Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 1 of 100 


EXHIBIT TTT-1 

Case No. l:14-cv-00857-TSC-DAR 



f 100 


AERA APA NCME 0000001 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 4 of 100 


STANDARDS 

for educational and psychological testing 


American Educational Research Association 
American Psychological Association 
National Council on Measurement in Education 


AERA APA NOME 0000003 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 5 of 100 


Copyright © 1999 by the American Educational 
Research Association, the American Psychological 
Association, and the National Council on 
Measurement in Education. .All rights reserved. 
Except as permitted under the United States 
Copyright Act of 1976, no part of this publication 
may be reproduced or distributed in any form or 
bv any means, or stored in a database or retrieval 
system, without the prior written permission of 
the publisher. 

Published by 

American Educational Research Association 
1430 K St, NW, Suite 1200 
Washington, DC 20005 

Library of Congress Card number: 99066845 
ISBN: 0-935302-25-5 
ISBN-13: 978-0-935302-25-7 

Printed in the United States of America 
First printing in 1999, second, 2002; third, 2004; 
fourth, 2007; fifth, 2008; and sixth, 2011. 

The Standards for Educational and Psychological 
Testing will be under continuing review by the 
three sponsoring organizations. Comments and 
suggestions will be welcome and should be sent to 
The Committee to Develop Standards for 
Educational and Psychological Testing in care of 
the Executive Office, American Psychological 
Association, 750 First Street, NE, Washington, 

DC 20002-4242. 

Prepared by the 

Joint Committee on Standards for Educational 
and Psychological Testing of the American 
Educational Research Association, the American 
Psychological Association, and the National 
Council on Measurement in Education. 


AERA APA NCME 0000004 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 6 of 100 


TABLE OF CONTENTS 

PREFACE v 

INTRODUCTION l 

Participants in the Testing Process 1 

The Purpose of the Standards 2 

Categories of Standards 2 

Tests and Test Uses to Which These Standards Apply 3 

Cautions to be Exercised in Using the Standards 4 

The Number of Standards 4 

Tests as Measures of Constructs 5 

Organization of This Volume 5 

PART I 

TEST CONSTRUCTION, EVALUATION, AND DOCUMENTATION 7 

1. Validity 9 

Background 9 

Standards 1.1-1.24 17 

2. Reliability and Errors of Measurement 25 

Background 25 

Standards 2.1-2.20 31 

3. Test Development and Revision 37 

Background 37 

Standards 3.1-3.27 43 

4. Scales, Norms, and Score Comparability 49 

Background 49 

Standards 4.1-4.21 54 

5. Test Administration, Scoring, and Reporting 61 

Background 61 

Standards 5.1-5.16 63 

6. Supporting Documentation for Tests 67 

Background 67 

Standards 6.1-6.15 68 

PART II 

FAIRNESS IN TESTING 71 

7. Fairness in Testing and Test Use 73 

Background 73 

Standards 7.1-7.12 80 

iii 


AERA APA NCME 0000005 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 7 of 100 


TABLE OF CONTENTS 


8. The Rights and Responsibilities of Test Takers 85 

Background 85 

Standards 8.1-8.13 86 

9. Testing Individuals of Diverse Linguistic Backgrounds 91 

Background 91 

Standards 9.1-9. 1 1 97 

10. Testing Individuals with Disabilities 101 

Background 101 

Standards 10.1-10.12 106 

PART III 

TESTING APPLICATIONS 109 

1 1 . The Responsibilities of Test Users ill 

Background 1 1 1 

Standards 11.1-11.24 113 

12. Psychological Testing and Assessment 1 19 

Background 1 19 

Standards 12.1-12.20 131 

13. Educational Testing and Assessment 137 

Background 137 

Standards 13.1-13.19 145 

14. Testing in Employment and Credentialing 151 

Background 1 5 1 

Standards 14.1-14.17 158 

15. Testing in Program Evaluation and Public Policy 163 

Background 163 

Standards 15.1-15.13 167 

GLOSSARY i7i 

INDEX 185 


iv 


AERA APA NCME 0000006 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 8 of 100 


There have been five earlier documents from 
three sponsoring organizations guiding the 
development and use of tests. The first of these 
was Technical Recommendations for Psychological 
Tests and Diagnostic Techniques, prepared by 
a committee of the American Psychological 
Association (APA) and published by that 
organization in 1954. The second was Technical 
Recommendations for Achievement Tests, prepared 
by a committee representing the American 
Educational Research Association (AERA) 
and the National Council on Measurement 
Used in Education (NCMUE) and published 
by the National Education Association in 
1955. The third, which replaced the earlier 
two, was published by APA in 1966 and 
prepared by a committee representing APA, 
AERA, and the National Council on 
Measurement in Education (NCME) and 
called the Standards for Educational and 
Psychological Tests and Manuals. The fourth, 
Standards for Educational and Psychological 
Tests, was again a collaboration of AERA, APA 
and NCME, and was published in 1974. The 
fifth, Standards for Educational and Psychological 
Testing, also a joint collaboration, was pub- 
lished in 1985. 

In 1991 APA’s Committee on Psycholo- 
gical Tests and Assessment suggested the need 
to revise the 1985 Standards. Representatives 
of AERA, APA and NCME met and discussed 
the revision, principles that should guide 
that revision, and potential Joint Committee 
members. By 1993, the presidents of the 
three organizations appointed members 
and the Committee had its first meeting 
November, 1993. 

The Standards has been developed by a 
joint committee appointed by AERA, APA and 
NCME. Members of the Committee were: 

Eva Baker, co-chair 

Paul Sackett, co-chair 

Lloyd Bond 

Leonard Feldt 


David Goh 
Bert Green 
Edward Haertel 
Jo-Ida Hansen 
Sharon Johnson-Lewis 
Suzanne Lane 
Joseph Matarazzo 
Manfred Meier 
Pamela Moss 
Esteban Olmedo 
Diana Pullin 

From 1993 to 1996 Charles Spielberger 
served on the Committee as co-chair. Each 
sponsoring organization was permitted 
to assign up to two liaisons to the Joint 
Committees project. Liaisons served as the 
conduits between the sponsoring organiza- 
tions and the Joint Committee. APA’s liaison 
from its Committee on Psychological Tests 
and Assessments changed several times as the 
membership of the Committee changed. 

Liaisons to the Joint Committee: 

AERA -William Mehrens 
APA - Bruce Bracken, Andrew Czopek, 
Rodney Lowman, Thomas Oakland 
NCME - Daniel Eignor 

APA and NCME also had committees 
who served to monitor the process and keep 
relevant parties informed. 

APA Ad Hoc Committee of the Council of 
Representatives: 

Melba Vasquez 
Donald Bersoff 
Stephen DeMers 
James Farr 
Bertram Karon 
Nadine Lambert 
Charles Spielberger 

NCME Standards and Test Use Committee: 

Gregory Cizek 
Allen Doolittle 
Le Ann Gamache 
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Donald Ross Green 
Ellen Julian 
Tracy Muenz 
Nambury Raju 

A management committee was formed at 
the beginning of this effort. They monitored 
the financial and administrative arrangements 
of the project, and advised the sponsoring 
organizations on such matters. 

Management Committee: 

Frank Farley, APA 
George Madaus, AERA 
Wendy Yen, NCME 

Staffing for the revision included Dianne 
Brown Maranto as project director, and 
Dianne L. Schneider as staff liaison. Wayne J. 
Camara served as project director from 1993 to 
1994. APA’s legal counsel conducted the legal 
review of the Standards . William C. Howell 
and William Mehrens reviewed the standards 
for consistency across chapters. Linda Murphy 
developed the indexing for the book. 

The Joint Committee solicited prelimi- 
nary reviews of some draft chapters, from rec- 
ognized experts. These reviews were primarily 
solicited for the technical and fairness chap- 
ters. Reviewers arc listed below: 

Marvin Alkin 
Philip Bashook 
Bruce Bloxom 
Jeffery P. Braden 
Robert L. Brennan 
John Callender 
Ronald Cannella 
Lee J. Cronbach 
James Cummins 
John Fremer 
Kurt F. Geisinger 
Robert M. Guion 
Walter Haney 
Patti L. Harrison 
Gerald P. Koocher 
Richard Jeanneret 


Frank Landy 
Ellen Lent 
Robert Linn 
Theresa C. Liu 
Stanford von Mayrhauser 
Milbrey W. McLaughlin 
Samuel Messick 
Craig N. Mills 
Robert J. Mislevy 
Kevin R. Murphy 
Mary Anne Nester 
Maria Pennock-Roman 
Carole Perlman 
Michael Rosenfeld 
Jonathan Sandoval 
Cynthia B. Schmeiser 
Kara Schmitt 
Neal Schmitt 
Richard J. Shavelson 
Lorrie A. Shepard 
Mark E. Swerdlik 
Janet Wall 
Anthony R. Zara 

Draft versions of the Standards were 
widely distributed for public review and 
comment three times during this revision 
effort, providing the Committee with a 
total of nearly 8,000 pages of comments. 
Organizations who submitted comments on 
drafts are listed below. Many individuals 
contributed to the input from each organi- 
zation, and although we wish we could 
acknowledge every individual who had input, 
we cannot do so due to incomplete informa- 
tion as to who contributed to each organiza- 
tion’s response. The Joint Committee could 
not have completed its task without the 
thoughtful reviews of so many professionals. 

Sponsoring Associations 

American Educational Research 
Association (AERA) 

American Psychological Association (APA) 
National Council on Measurement in 
Education (NCME) 
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Membership Organizations (Scientific, 
Professional, Trade & Advocacy) 

American Association for Higher 
Education (AAHE) 

American Board of Medical Specialties 
(ABMS) 

American Counseling Association (ACA) 
American Evaluation Association (AEA) 
American Occupational Therapy 
Association 

American Psychological Society (APS) 
APA Division of Counseling Psychology 
(Division 17) 

APA Division of Developmental 
Psychology (Division 7) 

APA Division of Evaluation, Measurement, 
and Statistics (Division 5) 

APA Division of Mental Retardation & 
Developmental Disabilities (Division 33) 
APA Division of Pharmacology & 
Substance Abuse (Division 28) 

APA Division of Rehabilitation 
Psychology (Division 22) 

APA Division of School Psychology 
(Division 16) 

Asian American Psychological 
Association (AAPA) 

Association for Assessment in 
Counseling (AAC) 

Association of Test Publishers (ATP) 
Australian Council for Educational 
Research Limited (ACER) 

Chicago Industrial/Organizational 
Psychologists (CIOP) 

Council on Licensure, Enforcement, and 
Regulation (CLEAR), Examination 
Resources & Advisory Committee 
(ERAC) 

Equal Employment Advisory Council 

(EEAC) 

Foundation for Rehabilitation 

Certification, Education and Research 
Human Sciences Research Council, 

South Africa 

International Association for Cross- 
Cultural Psychology (IACCP) 


International Brotherhood of Electrical 
Workers 

International Language Testing Association 
International Personnel Management 
Association Assessment Council 
(IPMAAC) 

Joint Committee on Testing Practices 
(JCTP) 

National Association for the Advancement 
of Colored People (NAACP), Legal 
Defense and Educational Fund, Inc. 
National Center for Fair and Open 
Testing (Fairtest) 

National Organization for Competency 
Assurance (NOCA) 

Personnel Testing Council of Metropolitan 
Washington (PTC/MW) 

Personnel Testing Council of Southern 
California (PTC/SC) 

Society for Human Resource Management 
(SHRM) 

Society of Indian Psychologists (SIP) 
Society for Industrial and Organizational 
Psychology (APA Division 14) 

Society for the Psychological Study 
of Ethnic Minority Issues (APA 
Division 45) 

State Collaborative on Assessment & 
Student Standards Technical Guidelines 
for Performance Assessment 
Consortium (TGPA) 
Telecommunications Staffing Forum 
Western Region Intergovernmental 
Personnel Assessment Council 
(WRIPAC) 

Credentialing Boards 

American Board of Physical and Medical 
Rehabilitation 

American Medical Technologists 
Commission on Rehabilitation 
Counselor Certification 
National Board for Certified Counselors 
(NBCC) 

National Board of Examiners in 
Optometry 


vii 
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National Board of Medical Examiners 
National Council of State Boards of 
Nursing 

Government and Federal Agencies 

Army Research Institute (ARI) 

California Highway Patrol, Personnel and 
Training Division, Selection Research 
Program 

City' of Dallas, Civil Service Department 
Commonwealth of Virginia, Department 
of Education 

Defense Manpower Data Center 

(DMDC), Personnel Testing Division 
Department of Defense (DOD), Office 
of the Assistant Secretary of Defense 
Department of Education, Office of 
Educational Improvement, National 
Center for Education Statistics 
Department of Justice, Immigration and 
Naturalization Service (INS) 
Department of Labor, Employment and 
Training Administration (DOL/ETA) 
U.S. Equal Employment Opportunity 
Commission (EEOC) 

U.S. Office of Personnel Management 
(OPM), Personnel Resources & 
Development Center 

Test Publishers/Developers 

American College Testing (ACT) 
CTB/McGraw-Hill 
The College Board 
Educational Testing Service (ETS) 
Highland Publishing Company 
Institute for Personality & Ability 
Testing (IPAT) 

Professional Examination Service (PES) 

Academic Institutions 

Center for Creative Leadership 
Gallaudet University, National Task 
Force on Equity in Testing Deaf 
Professionals 

University of Haifa, Israeli Group 
Kansas State University 
National Center on Educational 
Outcomes (NCEO) 


Pennsylvania State University' 

University of North Carolina - Charlotte 
University of Southern Mississippi, 
Department of Psychology 

When the Joint Committee completed 
its task of revising the Standards, it then 
submitted its work to the three sponsoring 
organizations for approval. Each organization 
had its own governing body and mechanism 
for approval, as well as definitions for what 
their approval means. 

AERA: This endorsement carries with it 
the understanding that, in general, we 
believe the Standards to represent the 
current consensus among recognized 
professionals regarding expected meas- 
urement practice. Developers, sponsors, 
publishers, and users of tests should 
observe these Standards. 

APA: The APA’s approval of the 
Standards means che Council adopts 
the document as APA policy. 

NCME: NCME endorses rhe Standards 
for Educational and Psychological Testing 
and recognizes that the intent of these 
Standards is to promote sound and 
responsible measurement practice. This 
endorsement carries with it a profes- 
sional imperative for NCME members 
to attend to the Standards. 

Although the Standards are prescriptive, the 
Standards itself does not contain enforcement 
mechanisms. These standards were formulated 
with the intent of being consistent with other 
standards, guidelines and codes of conduct 
published by the three sponsoring organizations, 
and listed below. The reader is encouraged to 
obtain these documents, some of which have 
references to testing and assessment in specific 
applications or settings. 

The Joint Committee on the 
Standards for Educational and 
Psychological Testing 
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INTRODUCTION 


Educational and psychological testing and 
assessment are among the most important 
contributions of behavioral science to our 
society, providing fundamental and signifi- 
cant improvements over previous practices. 
Although not all tests are well-developed nor 
are all testing practices wise and beneficial, 
there is extensive evidence documenting the 
effectiveness of well-constructed tests for uses 
supported by validity evidence. The proper 
use of tests can result in wiser decisions about 
individuals and programs than would be the 
case without their use and also can provide a 
route to broader and more equitable access to 
education and employment. The improper 
use of tests, however, can cause considerable 
harm to test takers and other parties affected 
by test-based decisions. The intent of the 
Standards is to promote the sound and ethical 
use of tests and to provide a basis for evaluat- 
ing the quality of testing practices. 

Participants in the Testing Process 

Educational and psychological testing and 
assessment involve and significantly affect 
individuals, institutions, and society as a 
whole. The individuals affected include stu- 
dents, parents, teachers, educational adminis- 
trators, job applicants, employees, clients, 
patients, supervisors, executives, and evalua- 
tors, among others. The institutions affected 
include schools, colleges, businesses, industry, 
clinics, and government agencies. Individuals 
and institutions benefit when testing helps them 
achieve their goals. Society, in turn, benefits 
when testing contributes to the achievement 
of individual and institutional goals. 

The interests of the various parties 
involved in the testing process are usually, 
but not always, congruent. For example, 
when a test is given for counseling purposes 
or for job placement, the interests of the 
individual and the institution often coin- 
cide. In contrast, when a test is used to 


select from among many individuals for a 
highly competitive job or for entry into an 
educational or training program, the prefer- 
ences of an applicant may be inconsistent 
with those of an employer or admissions 
officer. Similarly, when testing is mandated 
by a court, the interests of the test taker may 
be different from those of the party requesting 
the court order. 

There are many participants in the testing 
process, including, among others: (a) those who 
prepare and develop the test; (b) those who 
publish and market the test; (c) those who 
administer and score the test; (d) those who 
use the test results for some decision-making 
purpose; (e) those who interpret test results for 
clients; (f) those who take the test by choice, 
direction, or necessity; (g) those who sponsor 
tests, which may be boards that represent 
institutions or governmental agencies that 
contract with a test developer for a specific 
instrument or service; and (h) those who select 
or review tests, evaluating their comparative 
merits or suitability for the uses proposed. 

These roles are sometimes combined and 
sometimes further divided. For example, in 
clinics the test taker is typically the intended 
beneficiary of the test results. In some situa- 
tions the test administrator is an agent of the 
test developer, and sometimes the test admin- 
istrator is also the test user. When an industrial 
organization prepares its own employment 
tests, it is both the developer and the user. 
Sometimes a test is developed by a test author 
but published, advertised, and distributed by 
an independent publisher, though the publisher 
may play an active role in the test development. 
Given this intermingling of roles, it is difficult 
to assign precise responsibility for addressing 
various standards to specific participants in 
the testing process. 

This document begins with a series of 
chapters on the test development process, 
which focus primarily on the responsibilities 
of test developers, and then turns to chapters 

1 


L 


AERA APA NOME 0000012 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 14 of 100 

INTRODUCTION 


on specific uses and applications, which focus 
primarily on responsibilities of test users. One 
chapter is devoted specifically to the rights 
and responsibilities of test takers. 

The Standards is based on the premise 
that effective testing and assessment require 
that all participants in the testing process pos- 
sess the knowledge, skills, and abilities rele- 
vant to their role in the testing process, as 
well as awareness of personal and contextual 
factors that may influence the testing process. 
They also should obtain any appropriate 
supervised experience and legislatively man- 
dated practice credentials necessary to perform 
competently those aspects of the testing 
process in which they engage. For example, 
test developers and those selecting and 
interpreting tests need adequate knowledge 
of psychometric principles such as validity 
and reliability. 

The Purpose of the Standards 

The purpose of publishing the Standards is 
to provide criteria for the evaluation of tests, 
testing practices, and the effects of test use. 
Although the evaluation of the appropriate- 
ness of a test or testing application should 
depend heavily on professional judgment, the 
Standards provides a frame of reference to 
assure that relevant issues are addressed. It is 
hoped that all professional test developers, 
sponsors, publishers, and users will adopt the 
Standards and encourage others to do so. 

The Standards makes no attempt to pro- 
vide psychometric answers to questions of 
public policy regarding the use of tests. In 
general, the Standards advocates that, within 
feasible limits, the relevant technical informa- 
tion be made available so that those involved 
in policy debate may be fully informed. 

Categories of Standards 

The 1 985 Standards designated each standard 
as “primary” (to be met by all tests before 
operational use), “secondary” (desirable, but 


not feasible in certain situations), or “condi- 
tional” (importance varies with application). 
The present Standards continues the tradition 
of expecting test developers and users to con- 
sider all standards before operational use; 
however, the Standards does not continue the 
practice of designating levels of importance. 
Instead, the text of each standard, and any 
accompanying commentary, discusses rhe 
conditions under which a standard is relevant. 
It was not the case that under the 1985 
Standards test developers and users were obli- 
gated to attend only to the primary standards. 
Rather, the term “conditional” meant that a 
standard was primary in some settings and 
secondary in others, thus requiring careful 
consideration of the applicability of each stan- 
dard for a given setting. 

The absence of designations such as 
"primary” or “conditional” should not be 
taken to imply that all standards are equally 
significant in any given situation. Depending 
on the context and purpose of test develop- 
ment or use, some standards will be more 
salient than others. Moreover, some standards 
are broad in scope, setting forth concerns or 
requirements relevant to nearly all tests or 
testing contexts, and other standards are nar- 
rower in scope. However, all standards are 
important in the contexts to which they 
apply. Any classification that gives the appear- 
ance of elevating the general importance of 
some standards over others could invite neglect 
of some standards that need to be addressed 
in particular situations. 

Further, the current Standards does not 
include standards considered secondary or 
“desirable.” The continued use of the second- 
ary designation would risk encouraging both 
the expansion of the Standards to encompass 
large numbers of “desirable” standards and 
che inappropriate assumption that any guide- 
line not included in the Standards as at least 
“secondary” was inconsequential. 

Unless otherwise specified in the stan- 
dard or commentary, and with the caveats 
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outlined below, standards should be met 
before operational test use. This means that 
each standard should be carefully considered 
to determine its applicability to the testing 
context under consideration. In a given case 
there may be a sound professional reason why 
adherence to the standard is unnecessary. It is 
also possible that there may be occasions 
when technical feasibility may influence 
whether a standard can be met prior to 
operational test use. For example, some 
standards may call for analyses of data chac 
may not be available at the point of initial 
operational test use. If test developers, users, 
and, when applicable, sponsors have deemed 
a standard to be inapplicable or unfeasible, 
they should be able, if called upon, to explain 
the basis for their decision. However, there 
is no expectation that documentation be 
routinely available of the decisions related 
to each standard. 

Tests and Test Uses to 
Which These Standards Apply 

A test is an evaluative device or procedure in 
which a sample of an examinee’s behavior in a 
specified domain is obtained and subsequent- 
ly evaluated and scored using a standardized 
process. While the label test is ordinarily 
reserved for instruments on which responses 
are evaluated for their correctness or quality 
and rhe terms scale or inventory are used for 
measures of attitudes, interest, and disposi- 
tions, the Standards uses the single term test 
to refer to all such evaluative devices. 

A distinction is sometimes made between 
test and assessment. Assessment is a broader 
term, commonly referring to a process that 
integrates test information with information 
from other sources (e.g., information from 
the individual’s social, educational, employ- 
ment, or psychological history). The applica- 
bility of the Standards to an evaluation device 
or method is not altered by the label applied 
to ir (e.g., test, assessment, scale, inventory). 


Tests differ on a number of dimensions: 
the mode in which test materials are present- 
ed (paper and pencil, oral, computerized 
administration, and so on); the degree to 
which stimulus materials are standardized; 
the type of response format (selection of a 
response from a set of alternatives as opposed 
to the production of a response); and the 
degree to which test materials are designed to 
reflect or simulate a particular context. In all 
cases, however, tests standardize the process 
by which test-taker responses to test materials 
are evaluated and scored. As noted in prior 
versions of the Standards , the same general 
types of information are needed for all vari- 
eties of tests. 

The precise demarcation between those 
measurement devices used in the fields of 
educational and psychological testing that do 
and do not fall within the purview of the 
Standards is difficult to identify. Although the 
Standards applies most directly to standard- 
ized measures generally recognized as “tests,” 
such as measures of ability, aptitude, achieve- 
ment, attitudes, interests, personality, cogni- 
tive functioning, and mental health, it may 
also be usefully applied in varying degrees to 
a broad range of less formal assessment tech- 
niques. Admittedly, it will generally not be 
possible to apply the Standards rigorously to 
unstandardized questionnaires or to the broad 
range of unstructured behavior samples used 
in some forms of clinic- and school-based 
psychological assessment (e.g., an intake inter- 
view), and to instructor-made tests that are 
used to evaluate student performance in edu- 
cation and training. It is useful to distinguish 
between devices that lay claim to the concepts 
and techniques of the field of educational and 
psychological testing from those which repre- 
sent nonstandardized or less standardized aids 
to day-to-day evaluative decisions. Although 
the principles and concepts underlying the 
Standards can be fruitfully applied to day-to- 
day decisions, such as when a business owner 
interviews a job applicant, a manager evalu- 


3 


AERA APA NCME 0000014 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 16 of 100 


INTRODUCTION 


ates the performance of subordinates, or a 
coach evaluates a prospective athlete, it would 
be overreaching to expect that the standards 
of the educational and psychological testing 
field be followed by those making such deci- 
sions. In contrast, a structured interviewing 
system developed by a psychologist and 
accompanied by claims that the system has 
been found to be predictive of job perform- 
ance in a variety of other settings falls within 
the purview of the Standards. 

Cautions to be Exercised in Using 
the Standards 

Several cautions are important to avoid mis- 
interpreting the Standards-. 

1) Evaluating the acceptability of a test 
or test application does not rest on the literal 
satisfaction of every standard in this docu- 
ment, and acceptability cannot be determined 
by using a checklist. Specific circumstances 
affect the importance of individual standards, 
and individual standards should not be con- 
sidered in isolation. Therefore, evaluating 
acceptability involves (a) professional judgment 
that is based on a knowledge of behavioral sci- 
ence, psychometrics, and the community 
standards in the professional field to which 
the tests apply; (b) the degree to which the 
intent of the standard has been satisfied by 
the test developer and user; (c) the alternatives 
that are readily available; and (d) research and 
experiential evidence regarding feasibility of 
meeting the standard. 

2) When tests are at issue in legal pro- 
ceedings and other venues requiring expert 
witness testimony it is essential that profes- 
sional judgment be based on the accepted 
corpus of knowledge in determining the rele- 
vance of particular standards in a given situa- 
tion. The intent of the Standards is to offer 
guidance for such judgments. 

3) Claims by test developers or test users 
that a test, manual, or procedure satisfies or 
follows these standards should be made with 


care. It is appropriate for developers or users 
to state that efforts were made to adhere to 
the Standards , and to provide documents 
describing and supporting those efforts. 
Blanket claims without supporting evidence 
should not be made. 

4) These standards are concerned with a 
field that is evolving. Consequently, there is 
a continuing need to monitor changes in the 
field and to revise this document as knowl- 
edge develops. 

5) Prescription of the use of specific 
technical methods is not the intent of the 
Standards. For example, where specific statis- 
tical reporting requirements are mentioned, 
the phrase “or generally accepted equivalent” 
always should be understood. 

The standards do not attempt to repeat 
or to incorporate the many legal or regulatory 
requirements that might be relevant to the 
issues they address. In some areas, such as the 
collection, analysis, and use of rest data and 
results for different subgroups, the law may 
both require participants in the testing process 
to take certain actions and prohibit those 
participants from taking other actions. Where 
it is apparent that one or more standards or 
comments address an issue on which estab- 
lished legal requirements may be particularly 
relevant, the standard, comment, or introduc- 
tory material may make note of that fact. 
Lack of specific reference to legal require- 
ments, however, does not imply that no rele- 
vant requirement exists. In all situations, 
participants in the testing process should 
separately consider and, where appropriate, 
obtain legal advice on legal and regulatory 
requirements. 

The Number of Standards 

The number of standards has increased from 
the 1985 Standards for a variety of reasons. 
First, and most importantly, new develop- 
ments have led to the addition of new stan- 
dards. Commonly these deal with new types 
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of tests or new uses for existing tests, rather 
than being broad standards applicable to all 
tests. Second, on the basis of recognition that 
some users of the Standards may turn only to 
chapters directly relevant to a given applica- 
tion, certain standards are repeated in differ- 
ent chapters. When such repetition occurs, 
the essence of the standard is the same. Only 
the wording, area of application, or elabora- 
tion in the comment is changed. Third, 
standards dealing with important nontechni- 
cal issues, such as avoiding conflicts of inter- 
est and equitable treatment of all test takers, 
have been added. Although such topics have 
not been addressed in prior versions of the 
Standards, they are not likely to be viewed as 
imposing burdensome new requirements. 
Thus che increase in the number of stan- 
dards does not per se signal an increase in 
the obligations placed on test developers 
and test users. 

Tests as Measures of Constructs 

We depart from some historical uses of the 
term “construct,” which reserve the term for 
characteristics that are not directly observable, 
but which are inferred from interrelated sets 
of observations. This historical perspective 
invites confusion. Some tests are viewed as 
measures of constructs, while others are not. 
In addition, considerable debate has ensued 
as to whether certain characteristics measured 
by tests are properly viewed as constructs. 
Furthermore, the types of validity evidence 
thought to be suitable can differ as a result 
of whether a given test is viewed as measur- 
ing a construct. 

We use the term construct more broadly 
as the concept or characteristic that a test is 
designed to measure. Rarely, if ever, is there a 
single possible meaning that can be attached 
to a test score or a pattern of test responses. 
Thus, it is always incumbent on a testing 
professional to specify the construct interpre- 
tation that will be made on the basis of the 


score or response pattern. The notion that 
some tests are not under the purview of the 
Standards because they do not measure con- 
structs is contrary to this use of the term. 
Also, as detailed in chapter 1, evolving con- 
ceptualizations of the concept of validity no 
longer speak of different types of validity but 
speak instead of different lines of validity evi- 
dence, all in service of providing information 
relevant to a specific intended interpretation 
of test scores. Thus, many lines of evidence 
can contribute to an understanding of the 
construct meaning of test scores. 

Organization of This Volume 

Part 1 of the Standards, “Test Construction, 
Evaluation, and Documentation,” contains 
standards for validity (ch. 1); reliability and 
errors of measurement (ch. 2); test develop- 
ment and revision (ch. 3); scaling, norming, 
and score comparability (ch. 4); test adminis- 
tration, scoring, and reporting (ch. 5); and 
supporting documentation for tests (ch. 6). 
Part II addresses “Fairness in Testing," and 
contains standards on fairness and bias (ch. 7); 
the rights and responsibilities of test takers 
(ch. 8); testing individuals of diverse linguis- 
tic backgrounds (ch. 9); and testing individu- 
als with disabilities (ch. 10). Part III treats 
specific “Testing Applications,” and contains 
standards involving general responsibilities of 
test users (ch. 1 1); psychological testing and 
assessment (ch. 12); educational testing and 
assessment (ch. 13); testing in employment 
and credentialing (ch. 14), and testing in pro- 
gram evaluation and public policy (ch. 15). 

Each chapter begins with introductory 
text that provides background for the stan- 
dards that follow. This revision of the 
Standards contains more extensive intro- 
ductory text material than its predecessor. 
Recognizing the common use of the Standards 
in the education of future test developers 
and users, the committee opted to provide a 
context for the standards themselves by pre- 
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senting more background material than in 
previous versions. This text is designed to 
assist in the interpretation of the standards 
that follow in each chapter. Although the text 
is at times prescriptive and exhortatory, it 
should not be interpreted as imposing addi- 
tional standards. 

The Standards also contains an index and 
includes a glossary that provides definitions 
for terms as they are specifically used in this 
volume. 
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Background 

Validity refers to the degree to which evidence 
and theory support the interpretations of test 
scores entailed by proposed uses of tests. 
Validity is, therefore, the most fundamental 
consideration in developing and evaluating 
tests. The process of validation involves accu- 
mulating evidence to provide a sound scientific 
basis for the proposed score interpretations. 
It is the interpretations of rest scores required 
by proposed uses that are evaluated, not the 
test itself. When test scores are used or inter- 
preted in more than one way, each intended 
interpretation must be validated. 

Validation logically begins with an explicit 
statement of the proposed interpretation of 
test scores, along with a rationale for the rele- 
vance of the interpretation to the proposed 
use. The proposed interpretation refers to the 
construct or concepts the test is intended to 
measure. Examples of constructs are mathe- 
matics achievement, performance as a com- 
puter technician, depression, and self-esteem. 
To support test development, the proposed 
interpretation is elaborated by describing 
its scope and extent and by delineating the 
aspects of the construct that are to be repre- 
sented. The detailed description provides a 
conceptual framework for the test, delineat- 
ing the knowledge, skills, abilities, processes, 
or characteristics to be assessed. The frame- 
work indicates how this representation of 
the construct is to be distinguished from 
other constructs and how it should relate 
to other variables. 

The conceptual framework is partially 
shaped by the ways in which test scores will 
be used. For instance, a test of mathematics 
achievement might be used to place a student 
in an appropriate program of instruction, to 
endorse a high school diploma, or to inform 
a college admissions decision. Each of these 
uses implies a somewhat different interpre- 
tation of the mathematics achievement test 


scores: that a student will benefit from a 
particular instructional intervention, that a 
student has mastered a specified curriculum, 
or that a student is likely to be successful 
with college-level work. Similarly, a test of 
self-esteem might be used for psychological 
counseling, to inform a decision about 
employment, or for the basic scientific pur- 
pose of elaborating the construct of self-esteem. 
Each of these potential uses shapes the specified 
framework and the proposed interpretation of 
the test’s scores and also has implications for 
test development and evaluation. 

Validation can be viewed as developing a 
scientifically sound validity argument to sup- 
port the intended interpretation of test scores 
and their relevance to the proposed use. The 
conceptual framework points to the kinds of 
evidence that might be collected to evaluate 
the proposed interpretation in light of the 
purposes of testing. As validation proceeds, 
and new evidence about the meaning of a 
test’s scores becomes available, revisions may 
be needed in the test, in the conceptual 
framework that shapes it, and even in the 
construct underlying the test. 

The wide variety of tests and circum- 
stances makes it natural that some types of 
evidence will be especially critical in a given 
case, whereas other types will be less useful. 
The decision about what types of evidence 
are important for validation in each instance 
can be clarified by developing a set of propo- 
sitions that support the proposed interpretation 
for the particular purpose of testing. For 
instance, when a mathematics achievement 
test is used to assess readiness for an advanced 
course, evidence for the following proposi- 
tions might be deemed necessary: (a) that cer- 
tain skills are prerequisite for the advanced 
course; (b) that the content domain of the 
test is consistent with these prerequisite skills; 
(c) that test scores can be generalized across 
relevant sets of items; (d) that test scores are 
not unduly influenced by ancillary variables, 
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such as writing ability; (e) that success in the 
advanced course can be validly assessed; and 
(0 that examinees with high scores on the 
test will be more successful in the advanced 
course than examinees with low scores on the 
test. Examples of propositions in other testing 
contexts might include, for instance, the 
proposition that examinees with high general 
anxiety scores experience significant anxiety 
in a range of settings, the proposition that a 
child’s score on an intelligence scale is strong- 
ly related to the child’s academic performance, 
or the proposition that a certain pattern of 
scores on a neuropsychological battery indi- 
cates impairment characteristic of brain injury. 
The validation process evolves as these propo- 
sitions are articulated and evidence is gathered 
to evaluate their soundness. 

Identifying the propositions implied by 
a proposed test interpretation can be facili- 
tated by considering rival hypotheses that 
may challenge the proposed interpretation. 
It is also useful to consider the perspectives 
of different interested parties, existing expe- 
rience with similar tests and contexts, and 
the expected consequences of the proposed 
test use. Plausible rival hypotheses can often 
be generated by considering whether a test 
measures less or more than its proposed 
construct. Such concerns are referred to as 
construct underrepresentation and construct- 
irrelevant variance. 

Construct underrepresentation refers to 
the degree to which a test fails to capture 
important aspects of the construct. It implies 
a narrowed meaning of test scores because 
the test does not adequately sample some 
types of content, engage some psychological 
processes, or elicit some ways of responding 
that are encompassed by the intended con- 
struct. Take, for example, a rest of reading 
comprehension intended to measure chil- 
dren’s ability to read and interpret stories 
with understanding. A particular test might 
underrepresent the intended construct because 
it did not contain a sufficient variety of read- 


ing passages or ignored a common type of 
reading material. As another example, a test 
of anxiety might measure only physiological 
reactions and not emotional, cognitive, or 
situational components. 

Construct-irrelevant variance refers to 
the degree to which test scores are affected by 
processes that are extraneous to its intended 
construct. The test scores may be systemati- 
cally influenced to some extent by compo- 
nents that are not part of the construct. In 
the case of a reading comprehension test, 
construct-irrelevant components might 
include an emotional reaction to the test 
content, familiarity with the subject matter 
of the reading passages on the test, or the 
writing skill needed to compose a response. 
Depending on the detailed definition of the 
construct, vocabulary knowledge or reading 
speed might also be irrelevant components. 
On a test of anxiety, a response bias to under- 
report anxiety might be considered a source 
of construct-irrelevant variance. 

Nearly all tests leave out elements that 
some potential users believe should be meas- 
ured and include some elements that some 
potential users consider inappropriate. 
Validation involves careful attention to possible 
distortions in meaning arising from inadequate 
representation of the construct and also to 
aspects of measurement such as test format, 
administration conditions, or language level 
that may materially limit or qualify the inter- 
pretation of test scores. That is, the process 
of validation may lead to revisions in the test, 
the conceptual framework of the test, or both. 
The revised test would then need validation. 

When propositions have been identified 
that would support the proposed interpretation 
of test scores, validation can proceed by devel- 
oping empirical evidence, examining relevant 
literature, and/or conducting logical analyses to 
evaluate each of these propositions. Empirical 
evidence may include both local evidence, 
produced within the contexts where the test 
will be used, and evidence from similar testing 
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applications in other settings. Use of existing 
evidence from similar tests and contexts can 
enhance the quality of the validity argument, 
especially when current data are limited. 

Because a validity argument typically 
depends on more than one proposition, strong 
evidence in support of one in no way dimin- 
ishes the need for evidence to support others. 
For example, a strong predictor-criterion rela- 
tionship in an employment setting is not suf- 
ficient to justify test use for selection without 
considering the appropriateness and meaning- 
fulness of the criterion measure. Professional 
judgment guides decisions regarding the spe- 
cific forms of evidence that can best support 
the intended interpretation and use. As in 
all scientific endeavors, the quality of the 
evidence is primary. A few lines of solid evi- 
dence regarding a particular proposition are 
better than numerous lines of evidence of 
questionable quality. 

Validation is the joint responsibility of 
the test developer and the test user. The test 
developer is responsible for furnishing rele- 
vant evidence and a rationale in support of 
the intended test use. The test user is ultimately 
responsible for evaluating the evidence in the 
particular setting in which the test is to be 
used. When the use of a test differs from that 
supported by the test developer, the test user 
bears special responsibility for validation. The 
standards apply to the validation process, for 
which the appropriate parties share responsi- 
bility. It should be noted thac important con- 
tributions to the validity evidence are made as 
other researchers report findings of investiga- 
tions that are related to the meaning of scores 
on the test. 

Sources of Validity Evidence 

The following sections outline various sources 
of evidence that might be used in evaluating a 
proposed interpretation of test scores for par- 
ticular purposes. These sources of evidence 
may illuminate different aspects of validity, 


but they do not represent distinct types of 
validity. Validity is a unitary concept. It is the 
degree to which all the accumulated evidence 
supports the intended interpretation of test 
scores for the proposed purpose. Like the 
1985 Standards, this edition refers to types of 
validity evidence, rather than distinct types of 
validity. To emphasize this distinction, the 
treatment that follows does not follow tradi- 
tional nomenclature (i.e., the use of the terms 
content validity or predictive validity). The 
glossary contains definitions of the traditional 
terms, explicating the difference between tra- 
ditional and current use. 

Evidence Based on Test Content 

Important validity evidence can be obtained 
from an analysis of the relationship between a 
test’s content and the construct it is intended 
to measure. Test content refers to the themes, 
wording, and format of the items, tasks, or 
questions on a test, as well as the guidelines for 
procedures regarding administration and scor- 
ing. Test developers often work from a specifi- 
cation of the content domain. The content 
specification carefully describes the content in 
detail, often with a classification of areas of 
content and types of items. Evidence based on 
test content can include logical or empirical 
analyses of the adequacy with which the test 
content represents the content domain and of 
the relevance of the content domain to the 
proposed interpretation of test scores. Evidence 
based on concent can also come from expert 
judgments of the relationship between parts 
of the test and the construct. For example, in 
developing a licensure test, the major facets of 
the specific occupation can be specified, and 
experts in that occupation can be asked to 
assign test items to the categories defined by 
those facets. They, or other qualified experts, 
can then judge the representativeness of the 
chosen set of items. Sometimes rules or algo- 
rithms can be constructed to select or generate 
items that differ systematically on the various 
facets of content, according to specifications. 


11 


AERA APA NOME 0000021 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 23 of 100 

VALIDITY / PART I 


Some tests are based on systematic obser- 
vations of behavior. For example, a listing of 
the tasks comprising a job domain may be 
developed from observations of behavior in a 
job, together with judgments of subject-matter 
experts. Expert judgments can be used to assess 
the relative importance, criticality, and/or fre- 
quency of the various tasks. A job sample test 
can then be constructed from a random or 
stratified sampling of tasks rated highly on 
these characteristics. The test can then be 
administered under standardized conditions 
in an ofif-the-job setting. 

The appropriateness of a given content 
domain is related to the specific inferences to 
be made from test scores. Thus, when consid- 
ering an available test for a purpose other than 
that for which it was first developed, it is 
especially important to evaluate the appropri- 
ateness of the original content domain for the 
proposed new use. In educational program 
evaluations, for example, tests may properly 
cover material that receives little or no atten- 
tion in the curriculum, as well as that toward 
which instruction is directed. Policymakers 
can then evaluate student achievement with 
respect to both content neglected and content 
addressed. On the other hand, when student 
mastery of a delivered curriculum is tested for 
purposes of informing decisions about indi- 
vidual students, such as promotion or gradua- 
tion, the framework elaborating a content 
domain is appropriately limited to what stu- 
dents have had an opportunity to learn from 
the curriculum as delivered. 

Evidence about content can be used, in 
part, to address questions about differences in 
the meaning or interpretation of test scores 
across relevant subgroups of examinees. Of 
particular concern is the extent to which con- 
struct underrepresentation or construct-irrele- 
vant components may give an unfair advantage 
or disadvantage to one or more subgroups of 
examinees. Careful review of the construct 
and test content domain by a diverse panel 
of experts may point to potential sources of 


irrelevant difficulty (or easiness) that require 
further investigation. 

Evidence Based on Response Processes 

Theoretical and empirical analyses of the 
response processes of test takers can provide 
evidence concerning the fit between the con- 
struct and the detailed nature of performance 
or response actually engaged in by examinees. 
For instance, if a test is intended to assess 
mathematical reasoning, it becomes impor- 
tant to determine whether examinees are, in 
fact, reasoning about the material given instead 
of following a standard algorithm. For another 
instance, scores on a scale intended to assess 
the degree of an individual’s extroversion or 
introversion should not be strongly influenced 
by social conformity. 

Evidence based on response processes 
generally comes from analyses of individual 
responses. Questioning test rakers about their 
performance strategies or responses to partic- 
ular items can yield evidence that enriches the 
definition of a construct. Maintaining records 
that monitor the development of a response 
to a writing task, through successive written 
drafts or electronically monitored revisions, 
for instance, also provides evidence of process. 
Documentation of other aspects of performance, 
like eye movements or response times, may 
also be relevant to some constructs. Inferences 
about processes involved in performance can 
also be developed by analyzing the relationship 
among parts of the test and between the test 
and other variables. Wide individual differ- 
ences in process can be revealing and may lead 
to reconsideration of certain test formats. 

Evidence of response processes can 
contribute to questions about differences in 
meaning or interpretation of test scores across 
relevant subgroups of examinees. Process stud- 
ies involving examinees from different sub- 
groups can assist in determining the extent to 
which capabilities irrelevant or ancillary to the 
construct may be differentially influencing 
their performance. 
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Studies of response processes are not lim- 
ited to the examinee. Assessments often rely 
on observers or judges to record and/or evalu- 
ate examinees’ performances or products. In 
such cases, relevant validity evidence includes 
the extent to which the processes of observers 
or judges are consistent with the intended 
interpretation of scores. For instance, if 
judges are expected to apply particular criteria 
in scoring examinees’ performances, it is 
important to ascertain whether they are, in 
fact, applying the appropriate criteria and not 
being influenced by factors that are irrelevant 
to the intended interpretation. Thus, valida- 
tion may include empirical studies of how 
observers or judges record and evaluate data 
along with analyses of the appropriateness of 
these processes to the intended interpretation 
or construct definition. 

Evidence Based on Internal Structure 

Analyses of the internal structure of a 
test can indicate the degree to which the 
relationships among test items and test com- 
ponents conform to the construct on which 
the proposed test score interpretations are 
based. The conceptual framework for a test 
may imply a single dimension of behavior, 
or it may posit several components that are 
each expected to be homogeneous, but that 
are also distinct from each other. For exam- 
ple, a measure of discomfort on a health sur- 
vey might assess both physical and emotional 
health. The extent to which item interrela- 
tionships bear out the presumptions of the 
framework would be relevant to validity. 

The specific types of analysis and their 
interpretation depend on how the test will 
be used. For example, if a particular appli- 
cation posited a series of test components of 
increasing difficulty, empirical evidence of 
the extent to which response patterns con- 
formed to this expectation would be provid- 
ed. A theory that posited unidimensionality 
would call for evidence of item homogene- 
ity. In this case, the item interrelationships 


also provide an estimate of score reliability, 
but such an index would be inappropriate for 
tests with a more complex internal structure. 

Some studies of the internal structure of 
tests are designed to show whether particular 
items may function differently for identifiable 
subgroups of examinees. Differential item 
functioning occurs when different groups 
of examinees with similar overall ability, or 
similar status on an appropriate criterion, 
have, on average, systematically different 
responses to a particular item. This issue is 
discussed in chapters 3 and 7. However, dif- 
ferential item functioning is not always a 
flaw or weakness. Subsets of items that have 
a specific characteristic in common (e.g., 
specific content, task representation) may 
function differently for different groups of 
similarly scoring examinees. This indicates 
a kind of multidimensionality that may be 
unexpected or may conform to the test 
framework. 

Evidence Based on Relations to Other Variables 

Analyses of the relationship of test scores 
to variables external to the test provide anoth- 
er important source of validity evidence. 
External variables may include measures of 
some criteria that the test is expected to pre- 
dict, as well as relationships to other tests 
hypothesized to measure the same constructs, 
and tests measuring related or different con- 
structs. Measures other than test scores, such 
as performance criteria, are often used in 
employment settings. Categorical variables, 
including group membership variables, 
become relevant when the theory underlying 
a proposed test use suggests that group differ- 
ences should be present or absent if a pro- 
posed test interpretation is to be supported. 
Evidence based on relationships with other 
variables addresses questions about the degree 
to which these relationships are consistent 
with the construct underlying the proposed 
test interpretations. 
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Convergent and discriminant evidence. 

Relationships between test scores and other 
measures intended to assess similar constructs 
provide convergent evidence, whereas rela- 
tionships between tesc scores and measures 
purportedly of different constructs provide 
discriminant evidence. For instance, within 
some theoretical frameworks, scores on a 
multiple-choice test of reading comprehen- 
sion might be expected to relate closely 
(convergent evidence) to other measures of 
reading comprehension based on other meth- 
ods, such as essay responses; conversely, test 
scores might be expected to relate less closely 
(discriminant evidence) to measures of other 
skills, such as logical reasoning. Relationships 
among different methods of measuring the 
construct can be especially helpful in sharp- 
ening and elaborating score meaning and 
interpretation. 

Evidence of relations with other variables 
can involve experimental as well as correla- 
tional evidence. Studies might be designed, 
for instance, to investigate whether scores on 
a measure of anxiety improve as a result of 
some psychological treatment or whether 
scores on a test of academic achievement dif- 
ferentiate between instructed and nonin- 
structed groups. If performance increases due 
to short-term coaching are viewed as a threat 
to validity, it would be useful to investigate 
whether coached and uncoached groups per- 
form differently. 

Test-criterion relationships. Evidence of 
the relation of test scores to a relevant criterion 
may be expressed in various ways, but the 
fundamental question is always: How accu- 
rately do test scores predict criterion per- 
formance? The degree of accuracy deemed 
necessary depends on the purpose for which 
the test is used. 

The criterion variable is a measure of some 
attribute or outcome that is of primary inter- 
est, as determined by test users, who may be 
administrators in a school system, the man- 
agement of a firm, or clients. The choice of 


the criterion and the measurement procedures 
used to obtain criterion scores are of central 
importance. The value of a test-critetion study 
depends on the relevance, reliability, and validity 
of the interpretation based on the criterion 
measure for a given testing application. 

Historically, two designs, often called 
predictive and concurrent, have been distin- 
guished for evaluating test-critetion relation- 
ships. A predictive study indicates how 
accurately test data can predict criterion scores 
that are obtained at a later time. A concurrent 
study obtains predictor and criterion infor- 
mation at about the same time. When predic- 
tion is actually contemplated, as in education 
or employment settings, or in planning reha- 
bilitation regimens, predictive studies can 
rerain the temporal differences and other 
characteristics of the practical situation. 
Concurrent evidence, which avoids temporal 
changes, is particularly useful for psychodiag- 
nostic tests or to investigate alternative meas- 
ures of some specified construct. In general, 
the choice of research strategy is guided by 
prior evidence of the extent to which predic- 
tive and concurrent studies yield the same or 
different results in the domain. 

Test scores are sometimes used in allocat- 
ing individuals to different treatments, such as 
different jobs within an institution, in a way 
that is advantageous for the institution and for 
the individuals. In that context, evidence is 
needed to judge the suitability of using a test 
when classifying or assigning a person to one 
job versus another or to one treatment versus 
another. Classification decisions are supported 
by evidence that the relationship of test scores 
to performance criteria is different for different 
treatments. It is possible for tests to be highly 
predictive of performance for different educa- 
tion programs or jobs without providing the 
information necessary to make a comparative 
judgment of the efficacy of assignments or 
treatments. In general, decision rules for 
selection or placement are also influenced by 
the number of persons to be accepted or the 
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numbers that can be accommodated in alter- 
native placement categories. 

Evidence about relations to other vari- 
ables is also used to investigate questions of 
differential prediction for groups. For instance, 
a finding that the relation of test scores to a 
relevant criterion variable differs from one 
group to another may imply that the mean- 
ing of the scores is not the same for members 
of the different groups, perhaps due to con- 
struct underrepresentation or construct-irrele- 
vant components. However, the difference 
may also imply that the criterion has different 
meaning for different groups. The differences 
in test-criterion relationships can also arise 
from measurement error, especially when 
group means differ, so such differences do 
not necessarily indicate differences in score 
meaning. (See chapter 7.) 

Validity generalization. An important 
issue in educational and employment settings 
is the degree to which evidence of validity 
based on test-criterion relations can be gener- 
alized to a new situation without further study 
of validity in that new situation. When a test 
is used to predict the same or similar criteria 
(e.g., performance of a given job) at different 
times or in different places, it is typically found 
that observed test-criterion correlations vary 
substantially. In the past, this has been taken 
to imply that local validation studies are always 
required. More recently, meta-analytic analyses 
have shown that in some domains, much of 
this variability may be due to statistical artifacts 
such as sampling fluctuations and variations 
across validation studies in the ranges of test 
scores and in the reliability of criterion meas- 
ures. When these and other influences are taken 
into account, it may be found that the remain- 
ing variability in validity coefficients is relatively 
small. Thus, statistical summaries of past vali- 
dation studies in similar situations may be 
useful in estimating test-criterion relationships 
in a new situation. This practice is referred to 
as the study of validity generalization. 


In some circumstances, there is a strong 
basis for using validity generalization. This 
would be the case where the meta-analytic 
database is large, where the meta-analytic data 
adequately represent the type of situation to 
which one wishes to generalize, and where 
correction for statistical artifacts produces a 
clear and consistent pattern of validity evi- 
dence. In such circumstances, the informa- 
tional value of a local validity study may be 
relatively limited. In other circumstances, the 
inferential leap required for generalization 
may be much larger. The meta-analytic data- 
base may be small, the findings may be less 
consistent, or the new situation may involve 
features markedly different from those repre- 
sented in the meta-analytic database. In such 
circumstances, situation-specific evidence of 
validity will be relatively more informative. 
Although research on validity generalization 
shows that results of a single local validation 
study may be quite imprecise, there are situa- 
tions where a single study, carefully done, 
with adequate sample size, provides sufficient 
evidence to support test use in a new situa- 
tion. This highlights the importance of exam- 
ining carefully the comparative informational 
value of local versus meta-analytic studies. 

In conducting studies of the generaliz- 
ability of validity evidence, the prior studies 
that are included may vary according to sev- 
eral situational facets. Some of the major 
facets are (a) differences in the way the pre- 
dictor construct is measured, (b) the type of 
job or curriculum involved, (c) the type of 
criterion measure used, (d) the type of test 
takers, and (e) the time period in which the 
study was conducted. In any particular study 
of validity generalization, any number of these 
facets might vary, and a major objective of the 
study is to determine empirically the extent 
to which variation in these facets affects the 
test-criterion correlations obtained. 

The extent to which predictive or con- 
current evidence of validity generalization can 
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be used in new situations is in large measure 
a function of accumulated research. Although 
evidence of generalization can often help to 
support a claim of validity in a new situation, 
the extent of available data limits the extent to 
which the claim can be sustained. 

The above discussion focuses on the use 
of cumulative databases to estimate predictor- 
criterion relationships. Meta-analytic tech- 
niques can also be used to summarize other 
forms of data relevant to other inferences one 
may wish to draw from test scores in a partic- 
ular application, such as effects of coaching 
and effects of certain alterations in testing 
conditions to accommodate test takers with 
certain disabilities. 

Evidence Based on Consequences of Testing 

An issue receiving attention in recent 
years is the incorporation of the intended and 
unintended consequences of test use into the 
concept of validity. Evidence about conse- 
quences can inform validity decisions. Here, 
however, it is important to distinguish 
between evidence that is directly relevant to 
validity and evidence that may inform deci- 
sions about social policy but falls outside 
the realm of validity. 

Distinguishing between issues of validity 
and issues of social policy becomes particularly 
important in cases where differential conse- 
quences of test use are observed for different 
identifiable groups. For example, concerns 
have been raised about the effect of group 
differences in test scores on employment 
selection and promotion, the placement of 
children in special education classes, and the 
narrowing of a school’s curriculum to exclude 
learning of objectives that are not assessed. 
Although information about the consequences 
of testing may influence decisions about test 
use, such consequences do not in and of 
themselves detract from the validity of intended 
test interpretations. Rather, judgments of 
validity or invalidity in the light of testing 


consequences depend on a more searching 
inquiry into the sources of those consequences. 

Take, as an example, a finding of different 
hiring rates for members of different groups as 
a consequence of using an employment test. If 
the difference is due solely to an unequal distri- 
bution of the skills the test purports to meas- 
ure, and if those skills are, in fact, important 
contributors to job performance, then the find- 
ing of group differences per se does not imply 
any lack of validity for the intended inference. 
If, however, the test measured skill differences 
unrelated to job performance (e.g., a sophisti- 
cated reading test for a job that required only 
minimal functional literacy), or if the differ- 
ences were due to the test’s sensitivity to some 
examinee characteristic not intended to be part 
of the test construct, then validity would be 
called into question, even if test scores correlat- 
ed positively with some measure of job per- 
formance. Thus, evidence about consequences 
may be directly relevant to validity when it can 
be traced to a source of invalidity such as con- 
struct underrepresentation or construct-irrele- 
vant components. Evidence about consequences 
that cannot be so traced — that in fact reflects 
valid differences in performance — is crucial in 
informing policy decisions but falls outside the 
technical purview of validiry. 

Tests are commonly administered in the 
expectation that some benefit will be realized 
from the intended use of the scores. A few of 
the many possible benefits are selection of 
efficacious treatments for therapy, placement 
of workers in suitable jobs, prevention of 
unqualified individuals from entering a pro- 
fession, or improvement of classroom instruc- 
tional practices. A fundamental purpose of 
validation is to indicate whether these specific 
benefits are likely to be tealized. Thus, in the 
case of a test used in placement decisions, the 
validation would be informed by evidence 
that alternative placements, in fact, are dif- 
ferentially beneficial to the persons and the 
institution. In the case of employment testing, 
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if a test publisher claims that use of the test 
will result in reduced employee training costs, 
improved workforce efficiency, or some other 
benefit, then the validation would be informed 
by evidence in support of that claim. 

Claims are sometimes made for benefits 
of testing that go beyond direct uses of the 
test scores themselves. Educational tests, for 
example, may be advocated on the grounds 
that their use will improve student motiva- 
tion or encourage changes in classroom 
instructional practices by holding educators 
accountable for valued learning outcomes. 
Where such claims are central to the rationale 
advanced for testing, the direct examination 
of testing consequences necessarily assumes 
even greater importance. The validation 
process in such cases would be informed by 
evidence that the anticipated benefits of test- 
ing are being realized. 

Integrating the Validity Evidence 

A sound validity argument integrates various 
strands of evidence into a coherent account 
of the degree to which existing evidence and 
theory support the intended interpretation of 
test scores for specific uses. It encompasses 
evidence gathered from new studies and evi- 
dence available from earlier reported research. 
The validity argument may indicate the need 
for refining the definition of the construct, may 
suggest revisions in the test or other aspects 
of the testing process, and may indicate areas 
needing further study. 

Ultimately, the validity of an intended 
interpretation of test scores relies on all the 
available evidence relevant to the technical 
quality of a testing system. This includes evi- 
dence of careful test construction; adequate 
score reliability, appropriate test administration 
and scoring; accurate score scaling, equating, 
and standard setting; and careful attention to 
fairness for all examinees, as described in subse- 
quent chapters of the Standards. 


Standard 1.1 

A rationale should be presented for each rec- 
ommended interpretation and use of test 
scores, together with a comprehensive sum- 
mary of the evidence and theory bearing on 
the intended use or interpretation. 

Comment: The rationale should indicate what 
propositions are necessary to investigate the 
intended interpretation. The comprehensive 
summary should combine logical analysis 
with empirical evidence to provide support 
for the test rationale. Evidence may come 
from studies conducted locally, in the setting 
where the test is to be used; from specific 
prior studies; or from comprehensive statisti- 
cal syntheses of available studies meeting 
clearly specified criteria. No type of evidence 
is inherently preferable to others; rather, the 
quality and relevance of the evidence to the 
intended test use determine the value of a 
particular kind of evidence. A presentation of 
empirical evidence on any point should give 
due weight to all relevant findings in the sci- 
entific literature, including chose inconsistent 
with the intended interpretation or use. Test 
developers have the responsibility to provide 
support for their own recommendations, but 
test users are responsible for evaluating the 
quality of the validity evidence provided and 
its relevance to the local situation. 

Standard 1.2 

The test developer should set forth clearly 
how test scores are intended to be interpret- 
ed and used. The population(s) for which a 
test is appropriate should be clearly delimit- 
ed, and the construct that the test is intend- 
ed to assess should be clearly described. 

Comment: Statements about validity should 
refer to particular interpretations and uses. It 
is incorrect to use the unqualified phrase “the 
validity of the test.” No test is valid for all 
purposes or in all situations. Each recom- 
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mended use or interpretation requires valida- 
tion and should specify in clear language the 
population for which the test is intended, the 
construct it is intended to measure, and the 
manner and contexts in which test scores are 
to be employed. 

Standard 1.3 

If validity for some common or likely inter- 
pretation has not been investigated, or if the 
interpretation is inconsistent with available 
evidence, that fact should be made clear and 
potential users should be cautioned about 
making unsupported interpretations. 

Comment: If past experience suggests that a 
test is likely to be used inappropriately for 
certain kinds of decisions, specific warnings 
against such uses should be given. On the 
other hand, no two situations are ever identi- 
cal, so some generalization by the user is 
always necessary. Professional judgment is 
required to evaluate the extent to which exist- 
ing validity evidence supports a given test use. 

Standard 1.4 

If a test is used in a way that has not been 
validated, it is incumbent on the user to jus- 
tify the new use, collecting new evidence if 
necessary. 

Comment: Professional judgment is required to 
evaluate the extent to which existing validity 
evidence applies in the new situation and to 
determine what new evidence may be needed. 
The amount and kinds of new evidence 
required may be influenced by experience with 
similar prior cest uses or interpretations and 
by the amount, quality, and relevance of 
existing data. 

Standard 1.5 

The composition of any sample of exam- 
inees from which validity evidence is 


obtained should be described in as much 
detail as is practical, including major rele- 
vant sociodemographic and developmental 
characteristics. 

Comment: Statistical findings can be influ- 
enced by factors affecting the sample on 
which the results are based. When the sample 
is intended to represent a population, that 
population should be described, and atten- 
tion should be drawn to any systematic fac- 
tors that may limit the representativeness of 
the sample. Factors that might reasonably be 
expected to affect the results include self- 
selection, attrition, linguistic prowess, disabil- 
ity status, and exclusion criteria, and others. 

If the subjects of a validity study are patients, 
for example, then the diagnoses of the 
patients are important, as well as other char- 
acteristics, such as the severity of the diag- 
nosed condition. For tests used in industry, 
the employment status (e.g., applicants versus 
current job holders), the general level of expe- 
rience and educational background and the 
gender and ethnic composition of the sample 
may be relevant information. For tests used 
in educational settings, relevant information 
may include educational background, devel- 
opmental level, community characteristics, or 
school admissions policies, as well as the gen- 
der and ethnic composition of the sample. 
Sometimes restrictions about privacy preclude 
obtaining such population information. 

Standard 1.6 

When the validation rests in part on the 
appropriateness of test content, the procedures 
followed in specifying and generating test con- 
tent should be described and justified in refer- 
ence to the construct the test is intended to 
measure or the domain it is intended to repre- 
sent. If the definition of the content sampled 
incorporates criteria such as importance, fre- 
quency, or criticality, these criteria should also 
be clearly explained and justified. 
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Comment: For example, test developers might 
provide a logical structure that maps the 
items on the test to the content domain, 
illustrating the relevance of each item and the 
adequacy with which the set of items repre- 
sents the content domain. Areas of the content 
domain that are not included among the test 
items could be indicated as well. 

Standard 1.7 

When a validation rests in part on the opin- 
ions or decisions of expert judges, observers, 
or raters, procedures for selecting such 
experts and for eliciting judgments or rat- 
ings should be fully described. The qualifi- 
cations, and experience, of the judges should 
be presented. The description of procedures 
should include any training and instructions 
provided, should indicate whether partici- 
pants reached their decisions independendy, 
and should report the level of agreement 
reached. If participants interacted with one 
another or exchanged information, the pro- 
cedures through which they may have influ- 
enced one another should be set forth. 

Comment: Systematic collection of judgments 
or opinions may occur at many points in test 
construction (e.g., in eliciting expert judg- 
ments of content appropriateness or adequate 
content representation), in formulating rules 
or standards for score interpretation (e.g., in 
setting cut scores), or in test scoring (e.g., rat- 
ing of essay responses). Whenever such proce- 
dures are employed, the quality of the resulting 
judgments is important to the validation. It 
may be entirely appropriate to have experts 
work together to reach consensus, but it would 
not then be appropriate to treat their respective 
judgments as statistically independent. 

Standard 1.8 

If the rationale for a test use or score inter- 
pretation depends on premises about the 
psychological processes or cognitive opera- 


tions used by examinees, then theoretical or 
empirical evidence in support of those prem- 
ises should be provided. When statements 
about the processes employed by observers 
or scorers axe part of the argument for valid- 
ity, similar information should be provided. 

Comment: If the test specification delineates 
the processes to be assessed, then evidence is 
needed that the test items do, in fact, tap the 
intended processes. 

Standard 1 .9 

If a test is claimed to be essentially unaffect- 
ed by practice and coaching, then the sensi- 
tivity of test performance to change with 
these forms of instruction should be docu- 
mented. 

Comment: Materials to aid in score interpreta- 
tion should summarize evidence indicating 
the degree to which improvement with prac- 
tice or coaching can be expected. Also, materi- 
als written for test rakers should provide 
practical guidance about the value of test 
preparation activities, including coaching. 

Standard 1.10 

When interpretation of performance on spe- 
cific items, or small subsets of items, is sug- 
gested, the rationale and relevant evidence in 
support of such interpretation should be 
provided. When interpretation of individual 
item responses is likely but is not recom- 
mended by the developer, the user should be 
warned against making such interpretations. 

Comment: Users should be given sufficient 
guidance to enable them to judge the degree 
of confidence warranted for any use or inter- 
pretation recommended by the test developer. 
Test manuals and score reports should dis- 
courage overinterpretation of information 
that may be subject to considerable error. 
This is especially important if interpretation 
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of performance on isolated items, small sub- 
sets of items, or subtest scores is suggested. 

Standard 1.11 

If the rationale for a test use or interpreta- 
tion depends on premises about the relation- 
ships among parts of the test, evidence 
concerning the internal structure of the test 
should be provided. 

Comment: It might be claimed, for example, 
that a test is essentially unidimensional. 
Such a claim could be supported by a mul- 
tivariate statistical analysis, such as a factor 
analysis, showing that the score variability 
attributable to one major dimension was 
much greater than the score variability 
attributable to any other identified dimen- 
sion. When a test provides more than one 
score, the interrelationships of those scores 
should be shown to be consistent with the 
construct(s) being assessed. 

Standard 1.12 

When interpretation of subscores, score dif- 
ferences, or profiles is suggested, the ration- 
ale and relevant evidence in support of such 
interpretation should be provided. Where 
composite scores are developed, the basis 
and rationale for arriving at the composites 
should be given. 

Comment: When a test provides more than 
one score, the distinctiveness of the separate 
scores should be demonstrated, and the inter- 
relationships of those scores should be shown 
to be consistent with the construct(s) being 
assessed. Moreover, evidence for the validity 
of interpretations of two separate scores would 
not necessarily justify an interpretation of the 
difference between them. Rather, the rationale 
and supporting evidence must pertain directly 
to the specific score or score combination to 
be interpreted or used. 


Standard 1.13 

When validity evidence includes statistical 
analyses of test results, either alone or 
together with data on other variables, the 
conditions under which the data were col- 
lected should be described in enough detail 
that users can judge the relevance of the 
statistical findings to local conditions. 
Attention should be drawn to any features 
of a validation data collection that are likely 
to differ from typical operational testing 
conditions and that could plausibly influ- 
ence test performance. 

Comment: Such conditions might include 
(but would nor be limited to) the following: 
examinee motivation or prior preparation, the 
distribution of test scores over examinees, the 
time allowed for examinees to respond or 
other administrative conditions, examiner 
training or other examiner characteristics, 
the time intervals separating collection of 
data on different measures, or conditions 
that may have changed since the validity 
evidence was obtained. 

Standard 1.14 

When validity evidence includes empirical 
analyses of test responses together with data 
on other variables, the rationale for selecting 
the additional variables should be provided. 
Where appropriate and feasible, evidence 
concerning the constructs represented by 
other variables, as well as their technical 
properties, should be presented or cited. 
Attention should be drawn to any likely 
sources of dependence (or lack of independ- 
ence) among variables other than dependen- 
cies among the construct(s) they represent. 

Comment: The patterns of association 
between and among scores on the instrument 
under study and other variables should be 
consistent with theoretical expectations. The 
additional variables might be demographic 
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characteristics, indicators of treatment condi- 
tions, or scores on other measures. They 
might include intended measures of the same 
construct or of different constructs. The relia- 
bility of scores from such other measures and 
the validity of intended interpretations of 
scores from these measures are an important 
part of the validity evidence for the instru- 
ment under study. If such variables include 
composite scores, the construction of the 
composites should be explained. In addition 
to considering the properties of each variable 
in isolation, it is important to guard against 
faulty interpretations arising from spurious 
sources of dependency among measures, 
including correlated errors or shared variance 
due to common methods of measurement or 
common elements. 

Standard 1.15 

When it is asserted that a certain level of 
test performance predicts adequate or 
inadequate criterion performance, informa- 
tion about the levels of criterion perform- 
ance associated with given levels of test 
scores should be provided. 

Comment: Regression equations are more use- 
ful than correlation coefficients, which are 
generally insufficient to fully describe patterns 
of association between tests and other vari- 
ables. Means, standard deviations, and other 
statistical summaries are needed, as well as 
information about the distribution of criteri- 
on performances conditional upon a given 
test score. Evidence of overall association 
between variables should be supplemented by 
information about the form of that associa- 
tion and about the variability associated with 
that association in different ranges of test 
scores. Note that data collections employing 
examinees selected for their extreme scores on 
one or more measures (extreme groups) typi- 
cally cannot provide adequate information 
about the association. 


Standard 1.16 

When validation relies on evidence that test 
scores are related to one or more criterion 
variables, information about the suitability 
and technical quality of the criteria should 
be reported. 

Comment: The description of each criterion 
variable should include evidence concerning 
its reliability, the extent to which it represents 
the intended construct (e.g., job performance), 
and the extent to which it is likely to be 
influenced by extraneous sources of variance. 
Special attention should be given to sources 
that previous research suggests may introduce 
extraneous variance that might bias the crite- 
rion for or against identifiable groups. 

Standard 1.17 

If test scores are used in conjunction with 
other quantifiable variables to predict some 
outcome or criterion, regression (or equiva- 
lent) analyses should include those additional 
relevant variables along with the test scores. 

Comment: In general, if several predictors of 
some criterion are available, the optimum 
combination of predictors cannot be deter- 
mined solely from separate, pairwise examina- 
tions of the criterion variable with each 
separate predictor in turn. It is often informa- 
tive to estimate the increment in predictive 
accuracy that may be expected when each 
variable, including the test score, is intro- 
duced in addition to all other available vari- 
ables. Analyses involving multiple predictors 
should be verified by cross-validation or 
equivalent analysis whenever feasible, and the 
precision of estimated regression coefficients 
should be reported. 

Standard 1.18 

When statistical adjustments, such as those 
for restriction of range or attenuation, are 
made, both adjusted and unadjusted coeffi- 
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dents, as well as the specific procedure used, 
and all statistics used in the adjustment, 
should be reported. 

Comment: The correlation between cwo vari- 
ables, such as test scores and criterion meas- 
ures, depends on the range of values on each 
variable. For example, the test scores and the 
criterion values of selected applicants will typi- 
cally have a smaller range than the scores of 
all applicants. Statistical methods are available 
for adjusting the correlation to reflect the 
population of interest rather than the sample 
available. Such adjustments are often appro- 
priate, as when comparing results across 
various situations. Reporting an adjusted 
correlation should be accompanied by a state- 
ment of the method and the statistics used in 
making the adjustment. 

Standard 1.19 

If a test is recommended for use in assigning 
persons to alternative treatments or is likely 
to be so used, and if outcomes from those 
treatments can reasonably be compared on a 
common criterion, then, whenever feasible, 
supporting evidence of differential outcomes 
should be provided. 

Comment: If a test is used for classification 
into alternative occupational, therapeutic, or 
educational programs, it is not sufficient just 
to show that the test predicts treatment out- 
comes. Support for the validity of the classifi- 
cation procedure is provided by showing that 
the test is useful in determining which per- 
sons are likely to profit differentially from 
one treatment or another. Treatment cate- 
gories may have to be combined to assemble 
sufficient cases for statistical analysis. It is rec- 
ognized, however, that such research may not 
be feasible, because ethical and legal con- 
straints on differential assignments may for- 
bid control groups. 


Standard 1.20 

When a meta-analysis is used as evidence of 
the strength of a test-criterion relationship, 
the test and the criterion variables in the 
local situation should be comparable with 
those in the studies summarized. If relevant 
research includes credible evidence that any 
other features of the testing application may 
influence the strength of the test-criterion 
relationship, the correspondence between 
those features in the local situation and in 
the meta-analysis should be reported. Any 
significant disparities that might limit the 
applicability of the meta-analytic findings to 
the local situation should be noted explicitly. 

Comment: The meta-analysis should incorpo- 
rate all available studies meeting explicitly 
stated inclusion criteria. Meta-analytic evi- 
dence used in test validation typically is based 
on a number of tests measuring the same or 
very similar constructs and criterion measures 
that likewise measure the same or similar 
constructs. A meta-analytic study may also be 
limited to a single test and a single criterion. 
For each study included in the analysis, the 
test-criterion relationship is expressed in some 
common metric, often as an effect size. The 
strength of the test-criterion relationship may 
be moderated by features of the situation in 
which the test and criterion measures were 
obtained (e.g., types of jobs, characteristics of 
test takers, time interval separating collection 
of test and criterion measures, year or decade 
in which the data were collected). If test-cri- 
terion relationships vary according to such 
moderator variables, then, the numbers of 
studies permitting, the meta-analysis should 
report separate estimated effect size distribu- 
tions conditional upon relevant situational 
features. This might be accomplished, for 
example, by reporting separate distributions 
for subsets of studies or by estimating the 
magnitudes of the influences of situational 
features on effect sizes. 
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Standard 1.21 

Any meta-analytic evidence used to support 
an intended test use should be clearly 
described, including methodological choices 
in identifying and coding studies, correcting 
for artifacts, and examining potential mod- 
erator variables. Assumptions made in cor- 
recting for artifacts such as criterion 
unreliability and range restriction should be 
presented, and the consequences of these 
assumptions made clear. 

Comment: Meta-analysis inevitably involves 
judgments regarding a number of method- 
ological choices. The bases for these judg- 
ments should be articulated. In the case of 
choices involving some degree of uncertainty, 
such as artifact corrections based on assumed 
values, the uncertainty should be acknowl- 
edged and the degree to which conclusions 
about validity hinge on these assumptions 
should be examined and reported. 

Standard 1 .22 

When it is clearly stated or implied that a 
recommended test use will result in a specif- 
ic outcome, the basis for expecting that out- 
come should be presented, together with 
relevant evidence. 

Comment: If it is asserted, for example, that 
using a given test for employee selection will 
result in reduced employee errors or training 
costs, evidence in support of that assertion 
should be provided. A given claim for the 
benefits of test use may be supported by logi- 
cal or theoretical argument as well as empiri- 
cal data. Due weight should be given to 
findings in the scientific literature that may 
be inconsistent with the stated expectation. 

Standard 1.23 

When a test use or score interpretation is 
recommended on the grounds that testing or 


the testing program per se will result in 
some indirect benefit in addition to the util- 
ity of information from the test scores them- 
selves, the rationale for anticipating the 
indirect benefit should be made explicit. 
Logical or theoretical arguments and empiri- 
cal evidence for the indirect benefit should 
be provided. Due weight should be given to 
any contradictory findings in the scientific 
literature, including findings suggesting 
important indirect outcomes other than 
those predicted. 

Comment: For example, certain educational 
testing programs have been advocated on 
the grounds that they would have a salutary 
influence on classroom instructional practices 
or would clarify students’ understanding of 
the kind or level of achievement they were 
expected to attain. To the extent that such 
claims enter into the justification for a testing 
program, they become part of the validity 
argument for test use and so should be exam- 
ined as part of the validation effort. Due 
weight should be given to evidence against 
such predictions, for example, evidence that 
under some conditions educational testing 
may have a negative effect on classroom 
instruction. 

Standard 1.24 

When unintended consequences result from 
test use, an attempt should be made to 
investigate whether such consequences arise 
from the test’s sensitivity to characteristics 
other than those it is intended to assess or 
to the test’s failure fully to represent the 
intended construct. 

Comment: The validity of test score interpre- 
tations may be limited by construct-irrelevant 
components or construct underrepresentation. 
When unintended consequences appear to 
stem, at least in part, from the use of one or 
more tests, it is especially important to check 
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that these consequences do not arise from 
such sources of invalidity. Although group 
differences, in and of themselves, do not call 
into question the validity of a proposed inter- 
pretation, they may increase the salience of 
plausible rival hypotheses that should be 
investigated as part of the validation effort. 
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2. RELIABILITY AND ERRORS OF 
MEASUREMENT 


Background 

A test, broadly defined, is a set of tasks designed 
to elicit or a scale to describe examinee behavior 
in a specified domain, or a system for collecting 
samples of an individual’s work in a particular 
area. Coupled with the device is a scoring pro- 
cedure that enables the examiner to quantify, 
evaluate, and interpret the behavior or work 
samples. Reliability refers to the consistency 
of such measurements when the testing pro- 
cedure is repeated on a population of individ- 
uals or groups. 

The discussion that follows introduces 
concepts and procedures that may not be famil- 
iar to some readers. It is not expected that the 
brief definitions and explanations presented 
here will be sufficient to enable the less sophis- 
ticated reader to become adequately conver- 
sant with these developments. To achieve a 
better understanding, such readers may need 
to consult more comprehensive treatments 
in the measurement literature. 

The usefulness of behavioral measure- 
ments presupposes that individuals and groups 
exhibit some degree of stability in their behav- 
ior. However, successive samples of behavior 
from the same person are rarely identical in all 
pertinent respects. An individual’s perform- 
ances, products, and responses to sets of test 
questions vary in their quality or character 
from one occasion to another, even under 
strictly controlled conditions. This variation 
is reflected in the examinees scores. The caus- 
es of this variability are generally unrelated to 
the purposes of measurement. An examinee 
may try harder, may make luckier guesses, be 
more alert, feel less anxious, or enjoy better 
health on one occasion than another. An 
examinee may have knowledge, experience, or 
understanding that is more relevant to some 
tasks than to others in the domain sampled 
by the test. Some individuals may exhibit less 


variation in their scores than others, but no 
examinee is completely consistent. Because of 
this variation and, in some instances, because 
of subjectivity in the scoring process, an indi- 
vidual’s obtained score and the average score 
of a group will always reflect at least a small 
amount of measurement error. 

To say that a score includes a component 
of error implies that there is a hypothetical 
error-free value that characterizes an examinee 
at the time of testing. In classical test theory 
this error-free value is referred to as the per- 
son’s true score for the test or measurement 
procedure. It is conceptualized as the hypo- 
thetical average score resulting from many 
repetitions of the test or alternate forms of 
the instrument. In statistical terms, the true 
score is a personal parameter and each observed 
score of an examinee is presumed to estimate 
this parameter. Under an approach to reliability 
estimation known as generalizability theory, a 
comparable concept is referred to as an exami- 
nee’s universe score. Under item response theory 
(IRT), a closely related concept is called an 
examinee’s ability or trait parameter, though 
observed scores and trait parameters may be 
stated in different units. The hypothetical dif- 
ference between an examinee’s observed score 
on any particular measurement and the exam- 
inee’s true or universe score for the procedure 
is called measurement error. 

The definition of what constitutes a 
standardized tesr or measurement procedure 
has broadened significantly in recent years. At 
one time the cardinal features of most stan- 
dardized tests were consistency of the test 
materials from examinee to examinee, close 
adherence to stipulated procedures for test 
administration, and use of prescribed scoring 
rules that could be applied with a high degree 
of consistency. These features were, in fact, 
what made a test “standardized,” and they 
made meaningful norms possible. In employ- 
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merit settings and certification programs, flex- 
ible measurement procedures have been in 
use for many years. Individualized oral exami- 
nations, simulations, analyses of extended 
case reports, and performance in real-life set- 
tings such as clinics are now commonplace. 

In education, however, large-scale testing pro- 
grams with a high degree of flexibility in test 
format and administrative procedures are a 
relatively recent development. In some pro- 
grams cumulative portfolios of student work 
have been substituted for more traditional 
end-of-year tests of achievement. Other pro- 
grams now allow examinees to choose their 
own topics to demonstrate their abilities. Still 
others permit or encourage small groups of 
examinees to work cooperatively in complet- 
ing the test. A science examination, for exam- 
ple, might involve a team of high school 
students who conduct a study of the sources 
of pollution in local streams and prepare a 
report on their findings. Examinations of 
this kind raise complex issues regarding the 
domain represented by the test and about 
the generalizability of individual and group 
scores. Each step toward greater flexibility 
almost inevitably enlarges the scope and mag- 
nitude of measurement error. However, it is 
possible that some of the resultant sacrifices 
in reliability may reduce construct irrelevance 
or construct underrepresentation in an assess- 
ment program. 

Characteristics and Implications of 
Measurement Error 

Errors of measurement are generally viewed as 
random and unpredictable. They are concep- 
tually distinguished from systematic errors, 
which may also affect performance of individ- 
uals or groups, but in a consistent rather than 
a random manner. For example, a systematic 
group error would occur as a result of differ- 
ences in the difficulty of test forms that have 
not been adequately equated. When one test 
form is less difficult than another, examinees 


who take the easier form may be expected to 
earn a higher average score than those who take 
the more difficult form. Such a difference 
would not be considered an error of measure- 
ment under most methods of quantifying and 
summarizing error, though generalizability 
theory would permit test form differences to 
be recognized as an error source. 

The systematic factors that may differen- 
tially affect the performance of individual test 
takers are not as easily detected or overridden 
as those affecting groups. For example, some 
examinees experience levels of test anxiety 
that severely impair cognitive efficiency. The 
presence of such a condition can sometimes 
be recognized in an examinee, but the effect 
cannot be overcome by statistical adjustments. 
The individual systematic errors are not gen- 
erally regarded as an element that contributes 
to unreliability. Rather, they constitute a 
source of construct-irrelevant variance and 
thus may detract from validity 

Important sources of measurement error 
may be broadly categorized as those rooted 
within the examinees and those external to 
them. Fluctuations in the level of an exam- 
inee’s motivation, interest, or attention and 
the inconsistent application of skills are clear- 
ly internal factors that may lead to score 
inconsistencies. Differences among testing 
sites in their freedom from distractions, the 
random effects of scorer subjectivity, and vari- 
ation in scorer standards are examples of 
external factors. The potency and importance 
of any particular source depend on the specif- 
ic conditions under which the measures are 
taken, how performances are scored, and the 
interpretations made from the scores. A partic- 
ular factor, such as the subjectivity in scoring, 
may be a significant source of measurement 
error in some assessments and a minor con- 
sideration in others. 

Some changes in scores from one occa- 
sion to another, it should be noted, are not 
regarded as error, because they result, in part, 
from an intervention, learning, or maturation 
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that has occurred between the initial and final 
measures. The difference within an individual 
indicates, to some extent, the effects of the 
intervention or the extent of growth. In such 
settings, change per se constitutes the phe- 
nomenon of interest. The difference or the 
change score then becomes the measure to 
which reliability pertains. 

Measurement error reduces the useful- 
ness of measures. It limits the extent to which 
test results can be generalized beyond the par- 
ticulars of a specific application of the meas- 
urement process. Therefore, it reduces the 
confidence that can be placed in any single 
measurement. Because random measurement 
errors are inconsistent and unpredictable, 
they cannot be removed from observed 
scores. However, their aggregate magnitude 
can be summarized in several ways, as dis- 
cussed below. 

Summarizing Reliability Data 

Information about measurement error is 
essential to the proper evaluation and use of 
an instrument. This is true whether the meas- 
ure is based on the responses to a specific set 
of questions, a portfolio of work samples, the 
performance of a task, or the creation of an 
original product. The ideal approach to the 
study of reliability entails independent repli- 
cation of the entire measurement process. 
However, only a rough or partial approxima- 
tion of such replication is possible in many 
testing situations, and investigation of measure- 
ment error may requite special studies that depan 
from routine testing procedures. Nevertheless, 
it should be the goal of test developers to 
investigate test reliability as fully as practical 
considerations permit. No test developer is 
exempt from this responsibility. 

The critical information on reliability 
includes the identification of the major 
sources of error, summary statistics bearing 
on the size of such errors, and the degree of 
generalizability of scores across alternate 


forms, scorers, administrations, or other rele- 
vant dimensions. It also includes a description 
of the examinee population to whom the 
foregoing data apply, as the data may accu- 
rately reflect what is true of one population 
but misrepresent what is true of another. For 
example, a given reliability coefficient or esti- 
mated standard error derived from scores of a 
nationally representative sample may differ 
significant! y from that obtained for a more 
homogeneous sample drawn from one gen- 
der, one ethnic group, or one community. 

Reliability information may be reported 
in terms of variances or standard deviations of 
measurement errors, in terms of one or more 
coefficients, or in terms of IRT-based test 
information functions. The standard error of 
measurement is the standard deviation of a 
hypothetical distribution of measurement 
errors that arises when a given population is 
assessed via a particular test or procedure. 
The overall variance of measurement errors is 
actually a weighted average of the values that 
hold at various true score levels. The variance 
at a particular level is called a conditional 
error variance and its square root a conditional 
standard error. Traditionally, three broad cate- 
gories of reliability coefficients have been rec- 
ognized: (a) coefficients derived from the 
administration of parallel forms in independent 
testing sessions (alternate-form coefficients); 
(b) coefficients obtained by administration 
of the same instrument on separate occa- 
sions (test-retest or stability coefficients); 
and (c) coefficients based on the relation- 
ships among scores derived from individual 
items or subsets of the items within a test, 
all data accruing from a single administra- 
tion (internal consistency coefficients). 
Where test scoring involves a high level of 
judgment, indexes of scorer consistency are 
commonly obtained. With the development 
of generalizability theory, the foregoing 
three categories may now be seen as special 
cases of a more general classification: gener- 
alizability coefficients. 
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Like traditional reliability coefficients, a 
generalizability coefficient is defined as the rario 
of true or universe score variance to observed 
score variance. Unlike traditional approaches 
to the study of reliability, however, generaliz- 
ability theory permits the researcher to specify 
and estimate the various components of true 
score variance, error variance, and observed 
score variance. Estimation is typically accom- 
plished by the application of the techniques 
of analysis of variance. Of special interest are 
the separate numerical estimates of the com- 
ponents of overall error variance. Such esti- 
mates permit examination of the contribution 
of each source of error to the overall measure- 
ment process. The generalizability approach 
also makes possible the estimation of coeffi- 
cients that apply to a wide variety of potential 
measurement designs. 

The test information function, an impor- 
tant result of IRT, efficiently summarizes how 
well the test discriminates among individuals 
at various levels of the ability' or trait being 
assessed. Under the IRT conceptualization, a 
mathematical function called the item charac- 
teristic curve or item response function is used 
as a model to represent the increasing propor- 
tion of correct responses to an item for groups 
at progressively higher levels of the ability or 
trait being measured. Given an adequate 
database, the parameters of the characteristic 
curve of each item in a test can be estimated. 
The test information function can then be 
approximated. This function may be viewed 
as a mathematical statement of the precision 
of measurement at each level of the given 
trait. Precision, in the IRT context, is analo- 
gous to the reciprocal of the conditional error 
variance of classical test theory. 

Interpretation of Reliability Data 

In general, reliability coefficients are most useful 
in comparing tests or measurement procedures, 
particularly those that yield scores in different 
units or metrics. However, such comparisons 


are rarely straightforward. Allowance must be 
made for differences in the variability of the 
groups on which the coefficients are based, 
the techniques used to obtain the coefficients, 
the sources of error reflected in the coeffi- 
cients, and the lengths of the instruments 
being compared in terms of testing time. 

Generalizability coefficients and the 
many coefficients included under the tradi- 
tional categories may appear to be inter- 
changeable, but some convey quite different 
information from others. A coefficient in any 
given category may encompass errors of 
measurement from a highly restricted per- 
spective, a very broad perspective, or some 
point between these extremes. For example, 
a coefficient may reflect error due to scorer 
inconsistencies but not reflect the variation 
that characterizes a succession of examinee 
performances or products. A coefficient may 
reflect only the internal consistency of item 
responses within an instrument and fail to 
reflect measurement error associated with 
day-to-day changes in examinee health, effi- 
ciency, or motivation. 

It should not be inferred, however, that 
alternate-form or test-retest coefficients based 
on test administrations several days or weeks 
apart are always preferable to internal consis- 
tency coefficients. For many tests, internal 
consistency coefficients do not differ signifi- 
cantly from alternate-form coefficients. Where 
only one form of a test exists, retesting may 
result in an inflated correlation between the 
first and second scores due to idiosyncratic 
features of the test or to examinee recall of 
initial responses. Also, an individual’s status 
on some attributes, such as mood or emo- 
tional state, may change significantly in a 
short period of time. In the assessment of 
such constructs the multiple measures that 
give rise to reliability estimates should be 
obtained within the short period in which the 
attribute remains stable. Therefore, for char- 
acteristics of this kind an internal consistency 
coefficient may be preferred. 
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The standard error of measurement is 
generally more relevant than the reliability 
coefficient once a measurement procedure has 
been adopted and interpretation of scores has 
become the user’s primary concern. It should 
be noted that standard errors share some of 
the ambiguities which characterize reliability 
coefficients, and estimates may vary in their 
quality. Information about the precision of 
measurement at each of several widely spaced 
score levels — that is, conditional standard 
errors — is usually a valuable supplement to the 
single statistic for all score levels combined. 
Like reliability and generalizability coeffi- 
cients, standard errors may reflect variation 
from many sources of error or only a few. 
For most purposes, a more comprehensive 
standard error is more informative than a 
less comprehensive value. However, there 
are many exceptions to this generalization. 
Practical constraints often preclude conduct 
of the kinds of studies that would yield esti- 
mates of the preferred standard errors. 

Measurements derived from observations 
of behavior or evaluations of products are espe- 
cially sensitive to a variety of error factors. These 
include evaluator biases and idiosyncrasies, 
scoring subjectivity, and intra-examinee factors 
that cause variation from one performance or 
product to another. The methods of general- 
izability theory are well suited to the investi- 
gation of the reliability of the scores on such 
measures. Estimates of the error variance 
associated with each specific source and with 
the interactions between sources indicate the 
extent to which examinee scores may be gen- 
eralized to a population of scorers and to a 
universe of products or performances. 

The interpretations of test scores may be 
broadly categorized as relative or absolute. 
Relative interpretations convey the standing 
of an individual or group within a reference 
population. Absolute interpretations relate the 
status of an individual or group to defined 
standards. These standards may originate in 
empirical data for one or more populations or 


be based entirely on authoritative judgment. 
Different values of the standard error apply 
to the two types of interpretations. 

The test information function can be 
perceived an alternative to traditional indices 
of measurement precision, but there are 
important distinctions that should be noted. 
Standard errors under classical test theory can 
be derived by several different approaches. 
These yield similar, but not identical, results. 
More significantly, standard errors, like relia- 
bility coefficients, may reflect a broad con- 
figuration of error factors or a restricted 
configuration, depending on the design of the 
reliability study. Test information functions, 
on the other hand, are limited to the restrict- 
ed definition of measurement error that is 
associated with internal consistency reliabili- 
ties. In addition, under IRT several different 
mathematical models have been proposed and 
accepted as the basic form of the item charac- 
teristic curve. Adoption of one model rather 
than another can have a material effect on the 
derived test information function. 

A final consideration has significant impli- 
cations for both IRT and classical approaches 
to quantification of test score precision. It is 
this: Indices of precision depend on the scale 
in which they are reported. An index stated 
in terms of raw scores or the trait level esti- 
mates of IRT may convey a radically different 
perception of reliability than the same index 
restated in terms of derived scores. This same 
contrast may hold for conditional standard 
errors. In terms of the basic score scale, preci- 
sion may appear to be high at one score level, 
low at another. But when the conditional 
standard errors are restated in units of derived 
scores, such as grade equivalents or standard 
scores, quite different trends in comparative 
precision may emerge. Therefore, measure- 
ment precision under both theories very 
strongly depends on the scale in which test 
scores are reported and interpreted. 

Precision and consistency in measure- 
ment are always desirable. However, the need 
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for precision increases as the consequences of 
decisions and interpretations grow in impor- 
tance. If a decision can and will be corrobo- 
rated by information from other sources or if 
an erroneous initial decision can be quickly 
corrected, scores with modest reliability may 
suffice. But if a test score leads to a decision 
that is not easily reversed, such as rejection or 
admission of a candidate to a professional 
school or the decision by a jury that a serious 
injury was sustained, the need for a high degree 
of precision is much greater. 

Where the purpose of measurement is 
classification, some measurement errors are 
more serious than others. An individual who 
is far above or far below the value established 
for pass/fail or for eligibility for a special pro- 
gram can be mismeasured without serious 
consequences. Mismeasurement of examinees 
whose true scores are close to the cut score is 
a more serious concern. The techniques used 
to quantify reliability should recognize these 
circumstances. This can be done by reporting 
the conditional standard error in the vicinity 
of the critical value. 

Some authorities have proposed that a 
semantic distinction be made between “relia- 
bility of scores’’ and “degree of agreement in 
classification.” The former term would be 
reserved for analysis of score variation under 
repeated measurement. The term classification 
consistency or inter-rater agreement , rather than 
reliability, would be used in discussions of 
consistency of classification. Adoption of such 
usage would make it clear that the impor- 
tance of an error of any given size depends on 
the proximity of the examinee’s score to the 
cut score. However, it should be recognized 
that the degree of consistency or agreement in 
examinee classification is specific to the cut 
score employed and its location within the 
score distribution. 

Average scores of groups, when interpret- 
ed as measures of program effectiveness, 
involve error factors that are not identical to 
those that operate at the individual level. For 


large groups, the positive and negative meas- 
urement errors of individuals may average out 
almost completely in group means. However, 
the sampling errors associated with the ran- 
dom sampling of persons who are tested for 
purposes of program evaluation are still pres- 
ent. This component of the variation in the 
mean achievement of school classes from year 
to year or in the average expressed satisfaction 
of successive samples of the clients of a pro- 
gram may constitute a potent source of error 
in program evaluations. It can be a significant 
source of error in inferences about programs 
even if there is a high degree of precision in 
individual test scores. Therefore, when an 
instrument is used to make group judgments, 
reliability data must bear directly on the 
interpretations specific to groups. Standard 
errors appropriate to individual scores are not 
appropriate measures of the precision of group 
averages. A more appropriate statistic is the 
standard error of the observed score means. 
Generalizability theory can provide more 
refined indices when the sources of measure- 
ment error are numerous and complex. 

Typically, developers and distributors of 
tests have primary responsibility for obtain- 
ing and reporting evidence of reliability or 
test information functions. The user must 
have such data to make an informed choice 
among alternative measurement approaches 
and will generally be unable to conduct relia- 
bility studies prior to operational use of an 
instrument. In some instances, however, local 
users of a test or procedure must accept at 
least partial responsibility for documenting 
the precision of measurement. This obliga- 
tion holds when one of the primary purposes 
of measurement is to rank or classify exam- 
inees within the local population. It also 
holds when users must rely on local scorers 
who are trained to use the scoring rubrics 
provided by the test developer. In such set- 
tings, local factors may materially affect the 
magnitude of error variance and observed 
score variance. Therefore, the reliability of 
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scores may differ appreciably from that report- 
ed by the developer. 

The reporting of reliability coefficients 
alone, with little detail regarding the methods 
used to estimate the coefficient, the nature of 
the group from which the data were derived, 
and the conditions under which the data were 
obtained constitutes inadequate documentation. 
General statements to the effect that a test is 
“reliable” or that it is “sufficiently reliable to 
permit interpretations of individual scores” are 
rarely, if ever, acceptable. It is the user who must 
take responsibility for determining whether or 
not scores are sufficiently trustworthy to justify 
anticipated uses and interpretations. Of course, 
test constructors and publishers are obligated 
to provide sufficient data to make informed 
judgments possible. 

As the foregoing comments emphasize, 
there is no single, preferred approach to 
quantification of reliability. No single index 
adequately conveys all of the relevant facts. 

No one method of investigation is optimal in 
all situations, nor is the test developer limited 
to a single approach for any instrument. The 
choice of estimation techniques and the mini- 
mum acceptable level for any index remain a 
matter of professional judgment. 

Although reliability is discussed here as an 
independent characteristic of test scores, it should 
be recognized that the level of reliability of scores 
has implications for the validity of score inter- 
pretations. Reliability data ultimately beat on 
the repeatability of the behavior elicited by the 
test and the consistency of the resultant scores. 
The data also bear on the consistency of classi- 
fications of individuals derived from the scores. 
To the extent that scores reflect random errors 
of measurement, their potential for accurate 
prediction of criteria, for beneficial examinee 
diagnosis, and for wise decision making is lim- 
ited. Relatively unreliable scores, in conjunction 
with other convergent information, may some- 
times be of value to a test user, but the level of 
a score’s reliability places limits on its unique 
contribution to validity for all purposes. 


Standard 2.1 

For each total score, subscore, or combina- 
tion of scores that is to be interpreted, esti- 
mates of relevant reliabilities and standard 
errors of measurement or test information 
functions should be reported. 

Comment: It is not sufficient to report esti- 
mates of reliabilities and standard errors of 
measurement only for total scores when sub- 
scores are also interpreted. The form-to-form 
and day-to-day consistency of total scores on 
a test may be acceptably high, yet subscores 
may have unacceptably low reliability. For all 
scores to be interpreted, users should be sup- 
plied with reliability data in enough detail to 
judge whether scores are precise enough for 
the users’ intended interpretations. Composites 
formed from selected subtests within a test 
battery are frequently proposed for predictive 
and diagnostic purposes. Users need informa- 
tion about the reliability of such composites. 

Standard 2.2 

The standard error of measurement, both 
overall and conditional (if relevant), should 
be reported both in raw score or original 
scale units and in units of each derived score 
recommended for use in test interpretation. 

Comment: The most common derived scores 
include standard scores, grade or age equiva- 
lents, and percentile ranks. Because raw scores 
on norm-referenced tests are only rarely inter- 
preted directly, standard errors in derived 
score units are more helpful to the typical test 
user. A confidence interval for an examinees 
true score, universe score, or percentile rank 
serves much the same purpose as a standard 
error and can be used as an alternative approach 
to convey reliability information. The impli- 
cations of the standard error of measurement 
are especially important in situations where 
decisions cannot be postponed and corrobo- 
rative sources of information are limited. 
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Standard 2.3 

When test interpretation emphasizes differ- 
ences between two observed scores of an 
individual or two averages of a group, relia- 
bility data, including standard errors, should 
be provided for such differences. 

Comment: Observed score differences are used 
for a variety of purposes. Achievement gains 
are frequently the subject of inferences for 
groups as well as individuals. Differences 
between verbal and performance scores of 
intelligence and scholastic ability tests are 
often employed in the diagnosis of cognitive 
impairment and learning problems. Psycho- 
diagnostic inferences are frequently drawn 
from the differences between subtest scores. 
Aptitude and achievement batteries, interest 
inventories, and personality assessments are 
commonly used to identify and quantify the 
relative strengths and weaknesses or the pat- 
tern of trait levels of an examinee. When the 
interpretation of test scores centers on the 
peaks and valleys in the examinee’s test score 
profile, the reliability of score differences for 
all pairs of scores is critical. 

Standard 2.4 

Each method of quantifying the precision 
or consistency of scores should be described 
clearly and expressed in terms of statistics 
appropriate to the method. The sampling 
procedures used to select examinees for relia- 
bility analyses and descriptive statistics on 
these samples should be reported. 

Comment: Information on the method of 
subject selection, sample sizes, means, stan- 
dard deviations, and demographic characteris- 
tics of the groups helps users judge the extent 
to which reported data apply to their own 
examinee populations. If the test-retest or 
alternate-form approach is used, the interval 
between testings should be indicated. Because 
there are many ways of estimating reliability, 


each influenced by different sources of meas- 
urement error, it is unacceptable to say simply, 
“The reliability of test X is .90.” A better 
statement would be, “The reliability coeffi- 
cient of .90 reported for scores on test X was 
obtained by correlating scores from forms A 
and B administered on successive days. The 
data were based on a sample of 400 1 Oth-grade 
students from five middle-class suburban 
schools in New York State. The demographic 
breakdown of this group was as follows: ....” 

Standard 2.5 

A reliability coefficient or standard error of 
measurement based on one approach should 
not be interpreted as interchangeable with 
another derived by a different technique 
unless their implicit definitions of measure- 
ment error are equivalent. 

Comment: Internal consistency, alternate- 
form, test-retest, and generalizability coeffi- 
cients should not be considered equivalent, as 
each may incorporate a unique definition of 
measurement error. Error variances derived 
via item response theory may not be equiva- 
lent to error variances estimated via other 
approaches. Test developers should indicate 
the sources. of error that are reflected in or 
ignored by the reported reliability indices. 

Standard 2.6 

If reliability coefficients are adjusted for restric- 
tion of range or variability, the adjustment pro- 
cedure and both the adjusted and unadjusted 
coefficients should be reported. The standard 
deviations of the group actually tested and of 
the target population, as well as the rationale 
for the adjustment, should be presented. 

Comment: Application of a correction for 
restriction in variability presumes that the 
available sample is not representative of the 
test-taker population to which users might be 
expected to generalize. The rationale for the 
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correction should consider the appropriate- 
ness of such a generalization. Adjustment for- 
mulas that presume constancy in the standard 
error across score levels should not be used 
unless constancy can be defended. 

Standard 2.7 

When subsets of items within a test are dic- 
tated by the test specifications and can be 
presumed to measure partially independent 
traits or abilities, reliability estimation pro- 
cedures should recognize the multifactor 
character of the instrument. 

Comment: The total score on a test that is 
clearly multifactor in nature should be treated 
as a composite score. If an internal consistency 
estimate of total score reliability is obtained 
by the split-halves procedure, the halves 
should be parallel in content and statistical 
characteristics. Stratified coefficient alpha 
should be used rather than the more familiar 
nonstratified coefficient. 

Standard 2.8 

Test users should be informed about the 
degree to which rate of work may affect 
examinee performance. 

Comment: It is not possible to state, in general, 
whether reliability coefficients will increase or 
decrease when rate of work becomes an impor- 
tant source of systematic variance. Rate of work, 
as an examinee trait, may be more stable or 
less stable from occasion to occasion than the 
other factors the test is designed to measure. 
Because speededness has differential effects on 
various estimates, information on speededness 
is helpful in interpreting reported coefficients. 

The importance of the speed factor can 
sometimes be inferred from analyses of item 
responses and from observations by examiners 
during test administrations conducted for 
reliability analyses. The distribution of “last 
item attempted” and increases in the frequen- 


cy of omitted responses toward the end of a 
test are also highly informative, though not 
conclusive, evidence regarding speededness. A 
decline in the proportion of correct responses, 
beyond that attributable to increasing item 
difficulty, may indicate that some examinees 
were responding randomly. With computer- 
administered tests, abnormally last item response 
times, particularly toward the end of the test, 
may also suggest that examinees were respond- 
ing randomly. In the case of constructed- 
response exercises, including essay questions, 
the completeness of the responses may sug- 
gest that time constraints had little effect on 
early items but a significant effect on later 
items. Introduction of a speed factor into 
what might otherwise be a power test may 
have a marked effect on alternate-form and 
test-retest reliabilities. A shift from a paper- 
and-pencil format to a computer-adminis- 
tered format may affect test speededness. 

Standard 2.9 

When a test is designed to reflect rate of 
work, reliability should be estimated by the 
alternate-form or test-retest approach, using 
separately timed administrations. 

Comment: Split-half coefficients based on 
separate scores from the odd-numbered and 
even-numbered items are known to yield 
inflated estimates of reliability for highly 
speeded tests. Coefficient alpha and other 
internal consistency coefficients may also be 
biased, though the size of the bias is not as 
clear as that for the split-halves coefficient. 

Standard 2.10 

When subjective judgment enters into test 
scoring, evidence should be provided on both 
inter-rater consistency in scoring and within- 
examinee consistency over repeated measure- 
ments. A clear distinction should be made 
among reliability data based on (a) independ- 
ent panels of raters scoring the same perform- 
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ances or products, (b) a single panel scoring 
successive performances or new products, and 
(c) independent panels scoring successive per- 
formances or new products. 

Comment: Task-to-task variations in the quality 
of an examinees performance and rater-to-rater 
inconsistencies in scoring represent independ- 
ent sources of measurement error. Reports of 
reliability studies should make dear which of 
these sources are reflected in the data. Where 
feasible, the error variances arising from each 
source should be estimated. Generalizability 
studies and variance component analyses are 
especially helpful in this regard. These analy- 
ses can provide separate error variance esti- 
mates for tasks within examinees, for judges, 
and for occasions within the time period of 
trait stability. Information should be provided 
on the qualifications of the judges used in 
reliability studies. 

Inter-ratet or inter-observer agreement 
may be particularly important for ratings and 
observational data that involve subtle discrimi- 
nations. It should be noted, however, that 
when raters evaluate positively correlated 
characteristics, a favorable or unfavorable 
assessment of one trait may color their opin- 
ions of other traits. Moreover, high inter-rater 
consistency does not imply high examinee 
consistency from task to task. Therefore, 
internal consistency within raters and inter- 
rater agreement do not guarantee high relia- 
bility of examinee scores. 

Standard 2.11 

If there are generally accepted theoretical or 
empirical reasons for expecting that reliabili- 
ty coefficients, standard errors of measure- 
ment, or test information functions will 
differ substantially for various subpopula- 
tions, publishers should provide reliability 
data as soon as feasible for each major popu- 
lation for which the test is recommended. 


Comment: If test score interpretation involves 
inferences within subpopulations as well as 
within the general population, reliability data 
should be provided for both the subpopulations 
and the general population. Test users who 
work exclusively with a specific cultural group 
or wirh individuals who have a particular dis- 
ability would benefit from an estimate of the 
standard error for such a subpopulation. Some 
groups of test takers — pre-school children, for 
example — tend to respond to test stimuli in a 
less consistent fashion than do older children. 

Standard 2.12 

If a test is proposed for use in several grades 
or over a range of chronological age groups 
and if separate norms are provided for each 
grade or each age group, reliability data should 
be provided for each age or grade population, 
not solely for all grades or ages combined. 

Comment: A reliability coefficient based on a 
sample of examinees spanning several grades 
or a broad range of ages in which average 
scores are steadily increasing will generally 
give a spuriously inflated impression of relia- 
bility. When a test is intended to discriminate 
within age or grade populations, reliability 
coefficients and standard errors should be 
reported separately for each population. 

Standard 2.13 

If local scorers are employed to apply gener- 
al scoring rules and principles specified by 
the test developer, local reliability data should 
be gathered and reported by local authorities 
when adequate size samples are available. 

Comment: For example, many statewide test- 
ing programs depend on local scoring of 
essays, constructed-response exercises, and 
performance tests. Reliability analyses bear on 
the possibility that additional training of scor- 
ers is needed and, hence, should be an inte- 
gral part of program monitoring. 
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STANDARDS 


Standard 2.14 

Conditional standard errors of measurement 
should be reported at several score levels if 
constancy cannot be assumed. Where cut scores 
are specified for selection or classification, the 
standard errors of measurement should be 
reported in the vicinity of each cut score. 

Comment: Estimation of conditional standard 
errors is usually feasible even with the sample 
sizes that are typically used for reliability 
analyses. If it is assumed that the standard 
error is constant over a broad range of score 
levels, the rationale for this assumption should 
be presented. 

Standard 2.15 

When a test or combination of measures is 
used to make categorical decisions, estimates 
should be provided of the percentage of 
examinees who would be classified in the 
same way on two applications of the proce- 
dure, using the same form or alternate forms 
of the instrument. 

Comment: When a test or composite is used to 
make categorical decisions, such as pass/fail, 
the standard error of measurement at or near 
the cut score has important implications for the 
trustworthiness of these decisions. However, 
the standard error cannot be translated into 
the expected percentage of consistent deci- 
sions unless assumptions are made about the 
form of the distributions of measurement 
errors and true scores. It is preferable that this 
percentage be estimated directly through the 
use of a repeated-measurements approach if 
consistent with the requirements of test secu- 
rity and if adequate samples are available. 

Standard 2.16 

In some testing situations, the items vary from 
examinee to examinee — through random selec- 
tion from an extensive item pool or application 


of algorithms based on the examinee’s level of 
performance on previous items or preferences 
with respect to item difficulty. In this type of 
testing, the preferred approach to reliability 
estimation is one based on successive adminis- 
trations of the test under conditions similar to 
those prevailing in operational test use. 

Comment: Varying the set of items presented 
to each examinee is an acceptable procedure 
in some settings. If this approach is used, reli- 
ability data should be appropriate to this pro- 
cedure. Estimates of standard errors of ability 
scores can be computed through the use of 
IRT and reported routinely as part of the 
adaptive testing procedure. However, those 
estimates are not an adequate substitute fot 
estimates based on successive administrations 
of the adaptive test, nor do they bear on the 
issue of stability over short intervals. IRT esti- 
mates are contingent on the adequacy of both 
the item parameter estimates and the item res- 
ponse models adopted in the theory. Estimates 
of reliabilities and standard errors of measure- 
ment based on the administration and analysis 
of alternate forms of an adaptive test reflect 
errors associated with the entire measurement 
process. The alternate-form estimates provide 
an independent check on the magnitude of 
the errors of measurement specific to the 
adaptive feature of the testing procedure. 

Standard 2.17 

When a test is available in both long and short 
versions, reliability data should be reported for 
scores on each version, preferably based on an 
independent administration of each. 

Comment: Some tests and test batteries are 
published in both a “full-length” version and 
a “survey” or “short” version. In many appli- 
cations the Spearman-Brown formula will sat- 
isfactorily approximate the reliability of one of 
these from data based on the other. However, 
context effects are commonplace in tests of 


35 


AERA APA NCME 0000045 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 47 of 100 


RELIABILITY AND ERRORS OF MEASUREMENT / PART I 


maximum performance. Also, the short ver- 
sion of a standardized test often comprises a 
nonrandom sample of items from the full- 
length version. Therefore, the shorter version 
may be more reliable or less reliable than the 
Spearman-Brown projections from the full- 
length version. The reliability of scores on 
each version is best evaluated through an 
independent administration of each, using 
the designated time limits. 

Standard 2.18 

When significant variations are permitted in 
test administration procedures, separate reli- 
ability analyses should be provided for scores 
produced under each major variation if ade- 
quate sample sizes are available. 

Comment: To accommodate examinees with 
disabilities, test publishers might authorize 
modifications in the procedures and time 
limits that are specified for the administration 
of the paper-and-pencil edition of a test. In 
some cases, modified editions of the test itself 
may be provided. For example, tape-recorded 
versions for use in a group setting or with 
individual equipment may be used to test 
examinees who exhibit reading disabilities or 
attention deficits. If such modifications can 
be employed with test takers who are not dis- 
abled, insights can be gained regarding the 
possible effects on test scores of these non- 
standard administrations. 

Standard 2.19 

When average test scores for groups are used 
in program evaluations, the groups tested 
should generally be regarded as a sample 
from a larger population, even if all exam- 
inees available at the time of measurement are 
tested. In such cases the standard error of the 
group mean should be reported, as it reflects 
variability due to sampling of examinees as 
well as variability due to measurement error. 


Comment: The graduating seniors of a liberal 
arcs college, the current clients of a social 
service agency, and analogous groups exposed 
to a program of interest typically constitute a 
sample in a longitudinal sense. Presumably, 
comparable groups from the same population 
will recur in future years, given static condi- 
tions. The factors leading to uncertainty in 
conclusions about program effectiveness arise 
from the sampling of persons as well as meas- 
urement error. Therefore, the standard error 
of the mean observed score, reflecting varia- 
tion in both true scores and measurement 
errors, represents a more realistic standard 
error in this setting. Even this value may 
underestimate the variability of group means 
over time. In many settings, the static condi- 
tions assumed under random sampling of 
persons do not prevail. 

Standard 2.20 

When the purpose of testing is to measure the 
performance of groups rather than individuals, 
a procedure frequendy used is to assign a small 
subset of items to each of many subsamples of 
examinees. Data are aggregated across sub- 
samples and item subsets to obtain a measure 
of group performance. When such procedures 
are used for program evaluation or population 
descriptions, reliability analyses must take the 
sampling scheme into account. 

Comment: This type of measurement program 
is termed matrix sampling. It is designed to 
reduce the time demanded of individual 
examinees and to increase the total number of 
items on which data are obtained. This test- 
ing approach provides the same type of infor- 
mation about group performances that would 
accrue if all examinees could respond to all 
exercises in the item pool. Reliability statistics 
must be appropriate to the sampling plan 
used with respect to examinees and items. 
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Background 

Test development is the process of producing 
a measure of some aspect of an individual’s 
knowledge, skill, ability, interests, attitudes, 
or other characteristics by developing items 
and combining them to form a test, accord- 
ing to a specified plan. Test development is 
guided by the stated purpose(s) of the test 
and the intended inferences to be made from 
the test scores. The test development process 
involves consideration of content, format, the 
context in which the test will be used, and 
the potential consequences of using the test. 
Test development also includes specifying 
conditions for administering the test, deter- 
mining procedures for scoring the test per- 
formance, and reporting the scores to test 
takers and test users. This chapter focuses pri- 
marily on the following aspects of test devel- 
opment: stating the purpose(s) of the test, 
defining a framework for the test, developing 
test specifications, developing and evaluating 
items and their associated scoring procedures, 
assembling the test, and revising the test. The 
first section describes the test development 
process that begins with a statement of the 
purpose(s) of the test and culminates with 
the assembly of the test. The second section 
addresses several special considerations in test 
development, including considerations in 
delineating the test framework and in devel- 
oping performance assessments. The chapter 
concludes with a discussion on test revision. 
Issues bearing on validity, reliability, and fair- 
ness are interwoven within the stages of test 
development. Each of these topics is addressed 
comprehensively in other chapters of the 
Standards-, validity in chapter 1, reliability in 
chapter 2, and aspects of fairness in chapters 
7, 8, 9, and 10. Additional material on test 
administration and scoring, and on reporting 
scores and results, is provided in chapter 5. 
Chapter 4 discusses score scales, and the focus 
of chapter 6 is test documents. 


Test Development 

The process of developing educational and psy- 
chological tests commonly begins with a state- 
ment of the purpose(s) of the test and the 
construct or content domain to be measured. 
Tests of the same construct or domain can dif- 
fer in important ways, because a number of 
decisions must be made as the test is developed. 
It is helpful to consider the four phases leading 
from the original statement of purpose(s) to the 
final product: (a) delineation of the purpose(s) 
of the test and the scope of the construct or the 
extent of the domain to be measured; (b) devel- 
opment and evaluation of the test specifica- 
tions; (c) development, field testing, evaluation, 
and selection of the items and scoring guides 
and procedures; and (d) assembly and evalua- 
tion of the test for operational use. What fol- 
lows is a description of typical test development 
procedures, though there may be sound reasons 
that some of these steps are followed in some 
settings and not in others. 

The first step is to extend the original 
statement of purpose(s), and the construct or 
content domain being considered, into a frame- 
work for the test that describes the extent of 
the domain, or the scope of the construct to 
be measured. The test framework, therefore, 
delineates the aspects (e.g. , content, skills, 
processes, and diagnostic features) of the con- 
struct or domain to be measured. For example, 
“Does eighth-grade mathematics include 

algebra?” “Does verbal ability include text 
comprehension as well as vocabulary?” “Does 
self-esteem include both feelings and acts?” 
The delineation of the test framework can be 
guided by theory or an analysis of the content 
domain or job requirements as in the case of 
many licensing and employment tests. The test 
framework serves as a guide to subsequent test 
evaluation. The chapter on validity provides a 
more thorough discussion of the relationships 
among the construct or content domain, the 
test framework, and the purpose(s) of the test. 
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Once decisions have been made about 
what the test is to measure, and what ics scores 
are intended to convey, the next step is to 
design the test by establishing test specifica- 
tions. The test specifications delineate the for- 
mat of items, tasks, or questions; the response 
format or conditions for responding; and the 
type of scoring procedures. The specifications 
may indicate the desired psychometric prop- 
erties of items, such as difficulty and discrimi- 
nation, as well as the desired test properties 
such as test difficulty, inter-item correlations, 
and reliability. The test specifications may 
also include such factors as time restrictions, 
characteristics of the intended population of 
test takers, and procedures for administration. 
All subsequent test development activities are 
guided by the test specifications. 

Test specifications will include, at least 
implicitly, an indication of whether the test 
scores will be primarily norm-referenced or 
criterion-referenced. When scores are norm- 
referenced, relative score interpretations are of 
primary interest. A score for an individual or 
for a definable group is ranked within one or 
more distributions of scores or compared to 
the average performance of test takers for var- 
ious reference populations (e.g., based on age, 
grade, diagnostic category, or job classifica- 
tion). When scores are criterion-referenced, 
absolute score interpretations are of primary 
interest. The meaning of such scores does not 
depend on rank information. Rather, the test 
score conveys directly a level of competence 
in some defined criterion domain. Both rela- 
tive and absolute interpretations are often 
used with a given test, but the test developer 
determines which approach is most relevant 
for that test. 

The nature of the item and response for- 
mats that may be specified depends on the 
purposes of the test and the defined domain 
of the test. Selected- response formats, such as 
multiple-choice items, are suitable for many 
purposes of testing. The test specifications 
indicate how many alternatives are to be used 


for each item. Other purposes may be more 
effectively served by a short construcred-response 
format. Short-answer items require a response 
of no more than a few words. Extended- response 
formats require the test taker to write a mote 
extensive response of one or more sentences 
or paragraphs. Performance assessments often 
seek to emulate the context or conditions in 
which the intended knowledge or skills are 
actually applied. One type of performance 
assessment, for example, is the standardized 
job or work sample. A task is presented to the 
test taker in a standardized format under 
standardized conditions. Job or work samples 
might include, for example, the assessment of 
a practitioners ability to make an accurate diag- 
nosis and recommend treatment for a defined 
condition, a managers ability to articulate goals 
for an organization, or a student’s proficiency 
in performing a science laboratory experiment. 

All types of items require some indica- 
tion of how to score the responses. For select- 
ed-response items, one alternative is considered 
the correct response in some testing programs. 
In other testing programs, the alternatives may 
be weighted differentially. For short-answer 
items, a list of acceptable alternatives may 
suffice; extended-response items need more 
detailed rules for scoring, sometimes called 
scoring rubrics. Scoring rubrics specify the crite- 
ria for evaluating performance and may vary in 
the degree of judgment entailed, in the number 
of score levels, and in other ways. It is com- 
mon practice for test developers to provide 
scorers with examples of performances at each 
of the score levels co help clarify the criteria. 

For extended-response items, including 
performance tasks, two major types of scoring 
procedures are used: analytic and holistic. Both 
of the procedures require explicit performance 
criteria that reflea the test framework. However, 
the approaches differ in the degree of detail 
provided in the evaluation report. Under the 
analytic scoring procedure, each critical 
dimension of the performance criteria is judged 
independendy, and separate scores are obtained 


38 


AERA APA NOME 0000048 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 50 of 100 


PART I / TEST DEVELOPMENT AND REVISION 


for each of these dimensions in addition to 
an overall score. Under the holistic scoring 
procedure, the same performance criteria may 
implicitly be considered, but only one overall 
score is provided. Because the analytic proce- 
dure provides information on a number of 
critical dimensions, it potentially provides valu- 
able information for diagnostic purposes and 
lends itself to evaluating strengths and weak- 
nesses of test takers. In contrast, the holistic 
procedure may be preferable when an overall 
judgment is desired and when the skills being 
assessed are complex and highly interrelated. 
Regardless of the type of scoring procedure, 
designing the items and developing the scoring 
rubrics and procedures is an integrated process. 

A participatory approach may be used in 
the design of items, scoring rubrics, and some- 
times the scoring process itself. Many interested 
persons (e.g., practitioners, teachers) may be 
involved in developing items and scoring rubrics, 
and/or evaluating the subsequent performan- 
ces. If a participatory approach is used, partici- 
pants’ knowledge about the domain being 
assessed and their ability to apply the scoring 
rubrics are of critical importance. Equally 
important, for those involved in developing 
tests and evaluating performances, is their 
familiarity with the nature of the population 
being tested. Relevant characteristics of the 
population being tested may include the typi- 
cal range of expected skill levels, their famil- 
iarity with the response modes required of 
them, and the primary language they use. 

The test developer usually assembles an 
item pool that consists of a larger set of items 
than what is required by the test specifications. 
This allows for the test developer to select 
a set of items for the test that meet the test 
specifications. The quality of the items is 
usually ascertained through item review pro- 
cedures and pilot testing. Items are reviewed 
for content quality, clarity and lack of ambi- 
guity. Items sometimes are reviewed for sensi- 
tivity to gender or cultural issues. An attempt 
is generally made to avoid words and topics 


that may offend or otherwise disturb some 
test takers, if less offensive material is equally 
useful. Often, a field test is developed and 
administered to a group of test takers who are 
somewhat representative of the target popula- 
tion for the test. The field test helps deter- 
mine some of the psychometric properties of 
the test items, such as an item’s difficulty and 
ability to discriminate among test takers of 
different standing on the scale. Ongoing test- 
ing programs often pretest items by inserting 
them into existing tests. Those items are not 
used in obtaining test scores of the test takers, 
but the item responses provide useful data for 
test development. 

The next step in test development is to 
assemble items into a test or to identify an 
item pool for an adaptive test. The test devel- 
oper is responsible for ensuring that the items 
selected for the test meet the requirements of 
the test specifications. Depending upon the 
purpose(s) of the test, relevant considerations 
in item selection may include the content 
quality and scope, the weighting of items and 
subdomains, and the appropriateness of the 
items selected for the intended population of 
test takers. Often test developers will specify 
the distribution of psychometric indices of 
the items to be included in the test. For 
example, the specified distribution of item 
difficulty indices for a selection test would 
differ from the distribution specified for a 
general achievement test. When psychometric 
indices of the items are estimated using item 
response theory (IRT), the fit of the model 
to the data is also evaluated. This is accom- 
plished by evaluating the extent to which the 
assumptions underlying the item response 
model (e.g., unidimensionality, local inde- 
pendence, speededness, and equality of slope 
parameters) are satisfied. 

The test developer is also responsible for 
ensuring that the scoring procedures are con- 
sistent with the putpose(s) of the test and 
facilitate meaningful score interpretation. The 
nature of the intended score interpretations 
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will determine the importance of psychometric 
characteristics of items in the test construction 
process. For example, indices of item difficulty 
and discrimination, and inter-item correlations, 
may be particularly important when relative 
score interpretations are intended. In the case 
of relative score interpretations, good discrim- 
ination among test takers at all points along 
the construct continuum is desirable. It is 
important, however, that the test specifica- 
tions are not compromised when optimizing 
the distribution of these indices. In the case 
of absolute score interpretations, different cri- 
teria apply. In this case, the extent to which 
the relevant domain has been adequately rep- 
resented is important even if many of the 
items are relatively easy or nondiscriminating 
within a relevant population. It is important, 
however, to assure the quality of the content 
of relatively easy or nondiscriminating items. 
If cut scores are necessary for score interpreta- 
tion in criterion-referenced programs, the level 
of item discrimination constitutes critical 
information primarily in the vicinity of the 
cut scores. Because of these differences in test 
development procedures, tests designed to 
facilitate one type of interpretation function 
less effectively for other types of interpretation. 
Given appropriate test design and supporting 
evidence, however, scores arising from some 
norm-referenced programs may provide rea- 
sonable absolute score interpretations and 
scores arising from some criterion-refer- 
enced programs may provide reasonable rela- 
tive score interpretations. 

When evaluating the qualiry of the items 
in the item pool and the test itself, test devel- 
opers often conduct studies of differential 
item functioning (see chapter 7). Differential 
item functioning is said to exist when test 
takers of approximately equal ability on the 
targeted construct or content domain differ 
in their responses to an item according to their 
group membership. In theory, the ultimate 
goal of such studies is to identify construct- 
irrelevant aspects of item content, item format, 


or scoring criteria that may differentially affect 
test scores of one or more groups of test tak- 
ers. When differential item functioning is 
detected, test developers try to identify plausi- 
ble explanations for the differences, and then 
they may replace or revise items that give rise 
to group differences if construct irrelevance is 
deemed likely. However, at this time, there has 
been little progress in discerning the cause or 
substantive themes that account for differen- 
tial item functioning on a group basis. Items 
for which the differential item functioning 
index is significant may constitute valid meas- 
ures of an element of the intended domain and 
differ in no way from other items that show 
nonsignificant indexes. When the differential 
item functioning index is significant, the test 
developer must take care that any replacement 
items or item revisions do not compromise 
the test specifications. 

When multiple forms of a test are pre- 
pared, the test specifications govern each of 
the forms. Also, when an item pool is devel- 
oped for a computerized adaptive test, the 
specifications refer both to the item pool and 
to the rules or procedures by which the indi- 
vidual item sets are created for each test taker. 
Some of the attractive features of computer- 
ized adaptive tests, such as tailoring the diffi- 
culty level of the items to the test taker’s 
ability, place additional constraints on the 
design of such tests. In general, a large num- 
ber of items is needed for a computerized 
adaptive test to ensure that each tailored item 
set meets the requirements of the test specifi- 
cations. Further, tests often are developed in 
the context of larger systems or programs. 
Multiple item sets, for example, may be creat- 
ed for use with different groups of test takers 
or on different testing dates. Last, when a 
short form of a test is prepared, the test speci- 
fications of the original test govern the short 
form. Differences in the test specifications 
and the psychometric properties of the short 
form and the original test will affect the inter- 
pretation of the scores derived from the short 
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form. In any of these cases, the same funda- 
mental methods and principles of test devel- 
opment apply. 

Special Considerations in Test 
Development 

This section elaborates on several topics dis- 
cussed above. First, considerations in delin- 
eating the framework for the test are discussed. 
Following this, considerations in the develop- 
ment of performance assessments and portfolios 
are addressed. 

Delineating the Framework for 
tiie Test 

The scenario presented above outlines what is 
often done to develop a test. However, the activ- 
ities do not always happen in a rigid sequence. 
There is often a subtle interplay between the 
process of conceptualizing a construct or con- 
tent domain and the development of a test of 
that construct or domain. The framework for 
the test provides a description of how the 
construct or domain will be represented. The 
procedures used to develop items and scoring 
rubrics and to examine item characteristics 
may often contribute to clarifying the frame- 
work. The extent to which the framework is 
defined a priori is dependent on the testing 
application. In many testing applications, a 
well-defined framework and detailed test speci- 
fications guide the development of items and 
their associated scoring rubrics and procedures. 
In some areas of psychological measurement, 
test development may be less dependent on 
an a priori defined framework and may rely 
more on a data-based approach that results in 
an empirically derived definition of the frame- 
work. In such instances, items are selected 
primarily on the basis of their empirical rela- 
tionship with an external criterion, their rela- 
tionships with one another, or their power to 
discriminate among groups of individuals. For 
example, construction of a selection test for 
sales personnel might be guided by the corre- 


lations of item scores with productivity meas- 
ures of current sales personnel or a measure of 
client satisfaction might be assembled from those 
items in an item pool that correlate most highly 
with customer loyalty. Similarly, an inventory 
to help identify different patterns of psychopa- 
thology might be developed using patients from 
different diagnostic subgroups. When test 
development relies on a data-based approach, 
it is likely that some items will be selected based 
on chance occurrences in the data. Cross-valida- 
tion studies are routinely conducted to deter- 
mine the tendency to select items by chance, 
which involves administering the test to a 
comparable sample. 

In many testing applications, the frame- 
work for the test is specified initially and this 
specification subsequently guides the develop- 
ment of items and scoring procedures. Empirical 
relationships may then be used to inform 
decisions about retaining, rejecting, or modi- 
fying items. Interpretations of scores from tests 
developed by this process have the advantage 
of a logical/theoretical and an empirical foun- 
dation for the underlying dimensions repre- 
sented by the test. 

Performance Assessments 

One distinction between performance 
assessments and other forms of tests has to do 
with the type of response that is required from 
the test takers. Performance assessments require 
the test takers to carry out a process such as 
playing a musical instrument or tuning a cars 
engine or to produce a product such as a writ- 
ten essay. Performance assessments generally 
require the test takers to demonstrate their 
abilities or skills in settings that closely resem- 
ble real-life settings. For example, an assess- 
ment of a psychologist in training may require 
the test taker to interview a client, choose 
appropriate tests, and arrive at diagnosis and 
plan for therapy. Performance assessments are 
diverse in nature and can be product-based as 
well as behavior-based. Because performance 
assessments typically consist of a small num- 
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her of tasks, establishing the extent to which 
the results can be generalized to the broader 
domain is particularly important. The use of 
test specifications will contribute to tasks being 
developed so as to systematically represent the 
critical dimensions to be assessed, leading to a 
more comprehensive coverage of the domain 
than what would occur if test specifications were 
not used. Further, both logical and empirical 
evidence are important to document the extent 
to which performance assessments — tasks as 
well as scoring criteria — reflect the processes 
or skills that are specified by the domain 
definition. When tasks are designed to elicit 
complex cognitive processes, logical analyses 
of the tasks and both logical and empirical 
analyses of the test takers’ performances on 
the tasks provide necessary validity evidence. 

Portfolios 

A unique type of performance assessment is an 
individual portfolio. Portfolios are systematic 
collections of work or educational products 
typically collected over time. Like other assess- 
ment procedures, the design of portfolios is 
dependent on the purpose. Typical purposes 
include judgment of the improvement in job 
or educational performance and evaluation of 
the eligibility for employment, promotion, or 
graduation. A well-designed portfolio specifies 
the nature of the work that is to be put into the 
portfolio. The portfolio may include entries such 
as representative products, the best work of the 
test taker, or indicators of progress. For example, 
in an employment setting involving promotion, 
employees may be instructed to include their 
best work or products. Alternatively, if the pur- 
pose is to judge a student’s educational growth, 
students may be asked to provide evidence of 
improvement with respect to particular com- 
petencies or skills. They may also be requested 
to provide justifications for the choices. Still other 
methods may include the use of videotapes, exhi- 
bitions, demonstrations, simulations, and so on. 

In employment settings, employees may be 
involved in the selection of their work and prod- 


ucts that demonstrate their competencies for 
promotion purposes. Analogously, in educa- 
donal applications, students may participate in 
the selection of some of their work and the prod- 
ucts to be included in their portfolios as well as 
in the evaluation of the materials. The specifi- 
cations for the portfolio indicate who is respon- 
sible for selecting its contents. For example, the 
specifications may state that the test taker, the 
examiner, or both parties working together should 
be involved in the selection of the concents of the 
portfolio. The particular responsibilities of each 
party are delineated in the specifications. The 
more standardized the contents and procedures 
of administration, the easier it is to establish 
comparability of portfolio-based scores. 
Regardless of the methods used, all performance 
assessments are evaluated by the same standards 
of technical quality as other forms of tests. 

Test Revisions 

Tests and their supporting documents (e.g., test 
manuals, technical manuals, user’s guides) ate 
reviewed periodically to determine whether 
revisions are needed. Revisions or amendments 
are necessary when new research data, significant 
changes in the domain, or new conditions of 
test use and interpretation would either improve 
the validity of interpretations of the test scores 
or suggest that the test is no longer fully appro- 
priate for its intended use. As an example, tests 
are revised if the test content or language has be- 
come outdated and, therefore, may subsequently 
affect the validity of the test score interpretations. 
Revisions to test content are also made to ensure 
the confidentiality of the test. It should be noted, 
however, that outdated norms may not have the 
same implications for revisions as an outdared test. 
For example, it may be necessary to update the 
norms for an achievement test after a period of 
rising or falling achievement in the norming 
population, or when there are changes in the 
test-taking population, but the test content 
itself may continue to be as relevant as it was 
when the test was developed. 
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STANDARDS 


Standard 3.1 

Tests and testing programs should be devel- 
oped on a sound scientific basis. Test devel- 
opers and publishers should compile and 
document adequate evidence bearing on 
test development. 

Standard 3.2 

The purpose(s) of the test, definition of the 
domain, and the test specifications should 
be stated clearly so that judgments can be 
made about the appropriateness of the 
defined domain for the stated purpose(s) 
of the test and about the relation of items 
to the dimensions of the domain they are 
intended to represent. 

Comment: The adequacy and usefulness of 
test interpretations depend on the rigor with 
which the purposes of the test and the domain 
represented by the test have been defined and 
explicated. The domain definition should be 
sufficiently detailed and delimited to show 
clearly what dimensions of knowledge, skill, 
processes, attitude, values, emotions, or 
behavior are included and what dimensions 
are excluded. A clear description will enhance 
accurate judgments by reviewers and others 
about the congruence of the defined domain 
and the test items. 

Standard 3.3 

The test specifications should be document- 
ed, along with their rationale and the 
process by which they were developed. The 
test specifications should define the content 
of the test, the proposed number of items, 
the item formats, the desired psychometric 
properties of the items, and the item and 
section arrangement. They should also speci- 
fy the amount of time for testing, directions 
to the test takers, procedures to be used for 
test administration and scoring, and other 
relevant information. 


Comment : Professional judgment plays a major 
role in developing the test specifications. The 
specific procedures used for developing the 
specifications depend on the purposes of the 
test. For example, in developing licensure and 
certification tests, practice analyses or job analy- 
ses usually provide the basis for defining the 
test specifications, and job analyses primarily 
serve this function for employment tests. For 
achievement tests to be given at the end of a 
course, the test specifications should be based 
on an outline of course content and goals. 
Whereas, for placement tests, it may be nec- 
essary to examine the required entry knowl- 
edge and skills for several courses. 

Standard 3.4 

The procedures used to interpret test scores, 
and, when appropriate, the normative or 
standardization samples or the criterion used 
should be documented. 

Comment: Test specifications may indicate that 
the intended score interpretations are for absolute 
or relative score interpretations, or both. In rel- 
ative score interpretations the status of an indi- 
vidual (or group) is determined by comparing 
the score (or mean score) to the performance of 
others in one or more defined populations. In 
absolute score interpretations, the score or aver- 
age is assumed to reflect direcdy a level of com- 
petence or mastery in some defined criterion 
domain. Tests designed to facilitate one type of 
interpretation function less effectively for other 
types of interpretations. Given appropriate test 
design and adequate supporting data, however, 
scores arising from norm-referenced testing pro- 
grams may provide reasonable absolute score 
interpretations and scores arising from criterion- 
referenced programs may provide reasonable 
relative score interpretations. 

Standard 3.5 

When appropriate, relevant experts external 
to the testing program should review the test 
specifications. The purpose of the review, the 
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Standard 3.7 


process by which the review is conducted, 
and the results of the review should be docu- 
mented. The qualifications, relevant experi- 
ences, and demographic characteristics of 
expert judges should also be documented. 

Comment: Expert review of the test specifica- 
tions may serve many useful purposes such as 
helping to assure content quality and repre- 
sentativeness. The expert judges may include 
individuals representing defined populations 
of concern to the test specifications. For exam- 
ple, if the test is related to ethnic minority 
concerns, the expert review typically includes 
members of appropriate ethnic minority 
groups or experts on minority group issues. 

Standard 3.6 

The type of items, the response formats, scor- 
ing procedures, and test administration proce- 
dures should be selected based on the purposes 
of the test, the domain to be measured, and 
the intended test takers. To the extent possible, 
test content should be chosen to ensure that 
intended inferences from test scores are equally 
valid for members of different groups of test 
takers. The test review process should include 
empirical analyses and, when appropriate, the 
use of expert judges to review items and 
response formats. The qualifications, relevant 
experiences, and demographic characteristics 
of expert judges should also be documented. 

Comment: Expert judges may be asked to iden- 
tify material likely to be inappropriate, confus- 
ing, or offensive for groups in the test-taking 
population. For example, judges may be asked 
to identify whether lack of exposure to problem 
contexts in mathematics word problems may 
be of concern for some groups of students. 
Various groups of test takers can be defined by 
characteristics such as age, ethnicity, culture, 
gender, disability, or demographic region. 
There is limited evidence, however, that expert 
reviews alleviate problems with bias in testing 
(see chapter 7). 

44 


The procedures used to develop, review, and 
try out items, and to select items from the 
item pool should be documented. If the 
items were classified into different categories 
or subtests according to the test specifica- 
tions, the procedures used for the classifica- 
tion and the appropriateness and accuracy 
of the classification should be documented. 

Comment: Empirical evidence and/or expert 
judgment are used to classify items according 
to categories of the test specifications. For 
example, professional panels may be used for 
classifying the items or for determining the 
appropriateness of the developer’s classifica- 
tion scheme. The panel and procedures used 
should be chosen with care as they will affect 
the accuracy of the classification. 


When item tryouts or field tests are con- 
ducted, the procedures used to select the 
sample(s) of test takers for item tryouts and 
the resulting characteristics of the sample(s) 
should be documented. When appropriate, 
the sample(s) should be as representative as 
possible of the population(s) for which the 
test is intended. 

Comment: Conditions which may differential- 
ly affect performance on the test items by the 
sample(s) as compared to the intended popu- 
lation(s) should be documented when appro- 
priate. As an example, test takers may be less 
motivated when they know their scores will 
not have an impact on them. 


When a test developer evaluates the psycho- 
metric properties of items, the classical or 
item response theory (IRT) model used for 
evaluating the psychometric properties of 
items should be documented. The sample used 
for estimating item properties should be de- 
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scribed and should be of adequate size and diver- 
sity for the procedure. The process by which 
items are selected and the data used for item 
selection, such as item difficulty, item discrimi- 
nation, and/or item information, should also 
be documented. When IRT is used to estimate 
item parameters in test development, the item 
response model, estimation procedures, and 
evidence of model fit should be documented. 

Comment: Although overall sample size is 
important, it is important also that there be an 
adequate number of cases in regions critical to 
the determination of the psychometric proper- 
ties of items. If the test is to achieve greatest 
precision in a particular part of the score scale 
and this consideration affects item selection, 
the manner in which item statistics are used 
needs to be carefully described. When IRT is 
used as the basis of test development, it is 
important to document the adequacy of fit of 
the model to the data. This is accomplished by 
providing information about the extent to 
which IRT assumptions (e.g., unidimensionali- 
ty, local item independence, or equality of slope 
parameters) are satisfied. 

Test developers should show that any dif- 
ferences between the administration conditions 
of the field test and the final form do not affect 
item performance. Conditions that can affect 
item statistics include item position, time 
limits, length of test, mode of testing (e.g., 
paper-and-pencil versus computer-administered), 
and use of calculators or other tools. For exam- 
ple, in field testing items, those placed at the 
end of a test might obtain poorer item statis- 
tics than those inserted within the test. 

Standard 3.10 

Test developers should conduct cross-valida- 
tion studies when items are selected primari- 
ly on the basis of empirical relationships 
rather than on the basis of content or theoreti- 
cal considerations. The extent to which the dif- 
ferent studies identify the same item set should 
be documented. 


Comment: When data-based approaches to test 
development are used, items are selected prima- 
rily on the basis of their empirical relationships 
with an external criterion, their relationships 
with one another, or their power to discrimi- 
nate among groups of individuals. Under these 
circumstances, it is likely that some items will 
be selected based on chance occurrences in the 
data used. Administering the test to a compara- 
ble sample of test takers or a hold-out sample 
provides a means by which the tendency to 
select items by chance can be determined. 

Standard 3.11 

Test developers should document the extent to 
which the content domain of a test represents 
the defined domain and test specifications. 

Comment: Test developers should provide evi- 
dence of the extent to which the test items and 
scoring criteria represent the defined domain. This 
affords a basis to help determine whether per- 
formance on the test can be generalized to the 
domain that is being assessed. This is especially 
important for tests that contain a small number 
of items such as performance assessments. Such 
evidence may be provided by expert judges. 

Standard 3.12 

The rationale and supporting evidence for 
computerized adaptive tests should be docu- 
mented. This documentation should include 
procedures used in selecting subsets of items 
for administration, in determining the start- 
ing point and termination conditions for the 
test, in scoring the test, and for controlling 
item exposure. 

Comment: It is important to assure that docu- 
mentation of the procedures does not com- 
promise the security of the test items. 

If a computerized adaptive test is intended 
to measure a number of different content sub- 
categories, item selection procedures are to assure 
that the subcategories are adequately represented 
by the items presented to the test taker. 
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Standard 3.13 

When a test score is derived from the differen- 
tial weighting of items, the test developer 
should document the rationale and process used 
to develop, review, and assign item weights. 
When the item weights are obtained based on 
empirical data, the sample used for obtaining 
item weights should be sufficiently large and 
representative of the population for which the 
test is intended. When the item weights are 
obtained based on expert judgment, the quali- 
fications of the judges should be documented. 

Comment: Changes in the population of test 
takers, along with other changes such as changes 
in instructions, training, or job requirements, 
may impact the original derived item weights, 
necessitating subsequent studies after an 
appropriate period of time. 

Standard 3.14 

The criteria used for scoring test takers’ per- 
formance on extended-response items should be 
documented. This documentation is especially 
important for performance assessments, such as 
scorable portfolios and essays, where the criteria 
for scoring may not be obvious to the user. 

Comment: The completeness and clarity of the 
tesc specifications, including the definition of the 
domain, are essential in developing the scoring 
criteria. The test developer needs to provide a 
cleat description of how the test scores are 
intended to be interpreted to help ensure the 
appropriateness of the scoring procedures. 

Standard 3.15 

When using a standardized testing format to 
collect structured behavior samples, the domain, 
test design, test specifications, and materials 
should be documented as for any other test. 
Such documentation should include a clear 
definition of the behavior expected of the test 
takers, the nature of the expected responses, and 
any materials or directions that are necessary 
to carry out the testing. 
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Comment: In developing a prompt, the age, lan- 
guage, experience, and ability level of test takers 
should be considered, as should other possible 
unique sources of difficulty for groups in the po- 
pulation to be tested. Test directions that specify 
time allowances, nature of the responses expect- 
ed, and rules regarding use of supplementary 
materials, such as notes, references, dictionaries, 
calculators, or manipulatives such as lab equip- 
ment, may be established via field testing. 

Standard 3.16 

If a short form of a test is prepared, for exam- 
ple, by reducing the number of items on the 
original test or organizing portions of a test into 
a separate form, the specifications of the short 
form should be as similar as possible to those 
of the original test. The procedures used for 
the reduction of items should be documented. 

Comment: The extent to which the specifica- 
tions of the short form differ from those of 
the original test, and the implications of such 
differences for interpreting the scores derived 
from the short form, should be documented. 

Standard 3.17 

When previous research indicates that irrele- 
vant variance could confound the domain def- 
inition underlying the test, then to the extent 
feasible, the test developer should investigate 
sources of irrelevant variance. Where possible, 
such sources of irrelevant variance should be 
removed ot reduced by the test developer. 

Standard 3.18 

For tests that have time limits, test development 
research should examine the degree to which 
scores include a speed component and evaluate 
the appropriateness of that component, given 
the domain the test is designed to measure. 

Standard 3.19 

The directions for test administration should 
be presented with sufficient clarity and empha- 
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sis so that it is possible for others to replicate 
adequately the administration conditions under 
which the data on reliability and validity, and, 
where appropriate, norms were obtained. 

Comment: Because all people administering 
tests, including those in schools, industry, and 
clinics, need to follow test administration con- 
ditions carefully, it is essential that test admin- 
istrators receive detailed instructions on test 
administration guidelines and procedures. 

Standard 3.20 

The instructions presented to test takers should 
contain sufficient detail so that test takers can 
respond to a task in the manner that the test 
developer intended. When appropriate, sample 
material, practice or sample questions, criteria 
for scoring, and a representative item identi- 
fied with each major area in the test’s classifi- 
cation or domain should be provided to the 
test takers prior to the administration of the 
test or included in the testing material as part 
of the standard administration instructions. 

Comment: For example, in a personality 
inventory it may be intended that test takers 
give the first response that occurs to them. 
Such an expectation should be made clear in 
the inventory directions. As another example, 
in directions for interest or occupational 
inventories, it may be important to specify 
whether test takers are to mark the activities 
they would like ideally or whether they are 
to consider both their opportunity and their 
ability realistically. 

The extent and nature of practice materi- 
als and directions depend on expected levels 
of knowledge among test takers. For example, 
in using a novel test format, it may be very 
important to provide the test taker a practice 
opportunity as part of the test administration. 
In some testing situations, it may be important 
for the instructions to address such matters as 
the effects that guessing and time limits have 
on test scores. If expansion or elaboration of 
the test instructions is permitted, the condi- 


tions under which this may be done should be 
stated clearly in the form of general rules and 
by giving representative examples. If no expan- 
sion or elaboration is to be permitted, this 
should be stated explicitly. Publishers should 
include guidance for dealing with typical 
questions from test takers. Users should be 
instructed how to deal with questions that 
may arise during the testing period. 

Standard 3.21 

If the test developer indicates that the condi- 
tions of administration are permitted to vary 
from one test taker or group to another, per- 
missible variation in conditions for adminis- 
tration should be identified, and a rationale 
for permitting the different conditions should 
be documented. 

Comment: In deciding whether the conditions 
of administration can vary, the test developer 
needs to consider and study the potential 
effects of varying conditions of administra- 
tion. If conditions of administration vary 
from the conditions studied by the test devel- 
oper or from those used in the development 
of norms, the comparability of the test scores 
may be weakened and the applicability of the 
norms can be questioned. 

Standard 3.22 

Procedures for scoring and, if relevant, 
scoring criteria should be presented by 
the test developer in sufficient detail and 
clarity to maximize the accuracy of scoring. 
Instructions for using rating scales or for 
deriving scores obtained by coding, scaling, 
or classifying constructed responses should 
be clear. This is especially critical if tests 
can be scored locally. 

Standard 3.23 

The process for selecting, training, and qualify- 
ing scorers should be documented by the test 
developer. The training materials, such as the 
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scoring rubrics and examples of test takers’ 
responses that illustrate the levels on the score 
scale, and the procedures for training scorers 
should result in a degree of agreement among 
scorers that allows for the scores to be interpret- 
ed as originally intended by the test developer. 
Scorer reliability and potential drift over time 
in raters’ scoring standards should be evaluat- 
ed and reported by the person(s) responsible 
for conducting the training session. 

Standard 3.24 

When scoring is done locally and requires 
scorer judgment, the test user is responsible 
for providing adequate training and instruc- 
tion to the scorers and for examining scorer 
agreement and accuracy. The test developer 
should document the expected level of scorer 
agreement and accuracy. 

Comment: A common practice of test devel- 
opers is to provide examples of training mate- 
rials (e.g., scoring rubrics, test takers’ responses 
at each score level) and procedures when scoring 
is done locally and requires scorer judgment. 

Standard 3.25 

A test should be amended or revised when 
new research data, significant changes in the 
domain represented, or newly recommended 
conditions of test use may lower the validity 
of test score interpretations. Although a test 
that remains useful need not be withdrawn 
or revised simply because of the passage of 
time, test developers and test publishers are 
responsible for monitoring changing condi- 
tions and for amending, revising, or with- 
drawing the test as indicated. 

Comment: Test developers need to consider a 
number of factors that may warrant the revi- 
sion of a test, including outdated test content 
and language. If an older version of a test is 
used when a newer version has been published 
ot made available, test users are responsible for 


providing evidence that the older version is 
as appropriate as the new version for that 
particular test use. 

Standard 3.26 

Tests should be labeled or advertised as 
“revised” only when they have been revised 
in significant ways. A phrase such as “with 
minor modification” should be used when 
the test has been modified in minor ways. 
The score scale should be adjusted to account 
for these modifications, and users should be 
informed of the adjustments made to the 
score scale. 

Comment: It is the test developer’s responsi- 
bility to determine whether revisions to a test 
would influence test score interpretations. If 
test score interpretations would be affected 
by the revisions, it would then be appropriate 
to label the test “revised.” When tests are 
revised, the nature of the revisions and their 
implications on test score interpretations 
should be documented. 

Standard 3.27 

If a test or part of a test is intended for 
research use only and is not distributed for 
operational use, statements to this effect 
should be displayed prominently on all rele- 
vant test administration and interpretation 
materials that are provided to the test user. 

Comment: This standard refers to tests that 
are intended for research use only and does 
not refer to standard test development func- 
tions that occur prior to the operational use 
of a test (e.g., field testing). 
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Background 

Test scores are reported on scales designed to 
assist score interpretation. Typically, scoring 
begins with responses to separate test items, 
which are often coded using 0 or 1 to represent 
wrong/right or negative/positive, but sometimes 
using numerical values to indicate finer response 
gradations. Then the item scores are combined, 
often by addition but sometimes by a more 
elaborate procedure, to obtain a raw score. Raw 
scores are determined, in part, by features of a 
test such as test length, choice of time limit, 
item difficulties, and the circumstances under 
which the test is administered. This makes raw 
scores difficult to interpret in the absence of 
further information. Interpretation and statisti- 
cal analyses may be facilitated by converting 
raw scores into an entirely different set of val- 
ues called derived scores or scale scores. The vari- 
ous scales used for reporting scores on college 
admissions tests, the standard scores often 
used to report results for intelligence scales or 
vocational interest and personality inventories, 
and the grade equivalents reported for achieve- 
ment tests in the elementary grades are exam- 
ples of scale scores. The process of developing 
such a score scale is called scaling a test. Scale 
scores may aid interpretation by indicating 
how a given score compares to those of other 
test takers, by enhancing the comparability of 
scores obtained using different forms of a test, 
or in other ways. 

Another way of assisting score interpreta- 
tion is to establish standards or cut scores that 
distinguish different score ranges. In some 
cases, a single cut score may define the bound- 
ary between passing and failing. In other cases, 
a series of cut scores may define distinct pro- 
ficiency levels. Cut scores may be established 
for either raw or scale scores. Both scale scores 
and standards or cut scores can be central to 
the use and interpretation of test scores. For 


that reason, their defensibility is an important 
consideration in test validation. There is a close 
connection between standards or cut scores 
and certain scale scores. If the successive score 
ranges defined by a series of cut scores are 
relabeled, say 0, 1,2, and so on, then a scale 
score has been created. 

In addition to facilitating interpretations 
of a single test form considered in isolation, 
scale scores are often created to enhance com- 
parability across different forms of the same 
test, across test formats or administration 
conditions, or even across tests designed to 
measure different constructs (e.g., related sub- 
tests in a battery). Equated scores from alter- 
nate forms of a test can often be interpreted 
more easily when expressed in scale score units 
rather than raw score units. Scaling may be 
used to place scores from different levels of an 
achievement test on a continuous scale and 
thereby facilitate inferences about growth or 
development. Scaling can also enhance the 
comparability of scores derived from tests in 
different areas, as in subtests within an apti- 
tude, interest, or achievement battery. 

Norm-Referenced and Criterion- 
Referenced Score Interpretations 

Individual raw scores or scale scores are often 
referred to the distribution of scores for one 
or more comparison groups to draw useful 
inferences about an individual’s performance. 
Test score interpretations based on such compar- 
isons are said to be norm-referenced. Percentile 
rank norms, for example, indicate the stand- 
ing of an individual or group within a defined 
population of individuals or groups. An example 
of such a comparison group might be fourth- 
grade students in the United States, tested in 
the last 2 months of a recent school year. 
Percentiles, averages, or other statistics for such 
reference groups are called norms. By showing 


49 


AERA APA NOME 0000059 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 61 of 100 

SCALES, NORMS, AND SCORE COMPARABILITY / PART I 


how the test score of a given examinee com- 
pares to those of others, norms assist in the 
classification or description of examinees. 

Other test score interpretations make no 
direct reference to the performance of other 
examinees. These interpretations may take a 
variety of forms; most are collectively referred 
to as criterion-referenced interpretations. Derived 
scores supporting such interpretations may 
indicate the likely proportion of correct 
responses on some larger domain of items, or 
the probability of an examinee’s answering 
particular sorts of items correctly. Other crite- 
rion-referenced interpretations may indicate 
the likelihood that some psychopathology is 
present. Still other criterion-referenced inter- 
pretations indicate the probability that an 
examinee’s level of tested knowledge or skill 
is adequate to perform successfully in some 
other setting; such probabilities may be sum- 
marized in an expectancy table. Scale scores 
to support such criterion-referenced score 
interpretations are often developed on the 
basis of statistical analyses of the relationships 
of test scores to other variables. 

Some scale scores are developed primarily 
to support norm-referenced interpretations 
and others, criterion-referenced interpretations. 
In practice, however, there is not always a sharp 
distinction. Both criterion-referenced and 
norm-referenced scales may be developed and 
used for the same test scores. Moreover, a 
norm-referenced score scale originally devel- 
oped, for example, to indicate performance 
relative to some specific reference population 
might, over time, also come to support crite- 
rion-referenced interpretations. This could 
happen as research and experience brought 
increased understanding of the capabilities 
implied by different scale score levels. 
Conversely, results of an educational assess- 
ment might be reported on a scale consisting 
of several ordered proficiency levels, defined 
by descriptions of the kinds of tasks students 
at each level were able to perform. That would 
be a criterion-referenced scale, but once the 


distribution of scores over levels was reported, 
say, for all eighth-grade students in a given 
state, individual students’ scores would also 
convey information about their standing rela- 
tive to that tested population. 

Interpretations based on cut scores may 
likewise be either criterion-referenced or 
norm-referenced. If qualitatively different 
descriptions are attached to successive score 
ranges, a criterion-referenced interpretation is 
supported. For example, the descriptions of 
performance levels in some assessment task 
scoring rubrics can enhance score interpreta- 
tion by summarizing the capabilities that must 
be demonstrated to merit a given score. In 
other cases, criterion-referenced interpretations 
may be based on empirically determined rela- 
tionships between test scores and other vari- 
ables. But when tests are used for selection, it 
may be appropriate to rank-order examinees 
according to their test performance and estab- 
lish a cut score so as to select a prespecified 
number or proportion of examinees from one 
end of the distribution, if the selection use is 
otherwise supported by relevant reliability 
and validity evidence. In such cases, the cut 
score interpretation is norm-referenced; the 
labels reject or fail versus accept or past are 
determined solely by an examinee’s standing 
relative to others tested. 

Criterion-referenced interpretations based 
on cut scores are sometimes criticized on the 
grounds that there is very rarely a sharp dis- 
tinction of any kind between those just below 
versus just above a cut score. A neuropsy- 
chological test may be helpful in diagnosing 
some particular impairment, for example, but 
the probability that the impairment is pres- 
ent is likely to increase continuously as a 
function of the test score. Cut scores may 
nonetheless aid in formulating rules for 
reaching decisions on the basis of test per- 
formance. It should be recognized, however, 
that the probability of misclassification will 
generally be relatively high for persons with 
scores close to the cut points. 
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Norms 

The validity of norm- referenced interpretations 
depends in part on the appropriateness of the 
reference group to which test scores are com- 
pared. Norms based on hospitalized patients, 
for example, might be inappropriate for some 
interpretations of nonhospitalized patients’ 
scores. Thus, it is important that reference 
populations be carefully^defined and clearly 
described. Validity of such interpretations also 
depends on the accuracy with which norms 
summarize the performance of the reference 
population. That population may be small 
enough that essentially the entire population 
can be tested (e.g., all pupils at a given grade 
level in a given district tested on the same 
occasion). Often, however, only a sample of 
examinees from the reference population is 
tested. It is then important that the norms be 
based on a technically sound, representative, 
scientific sample of sufficient size. Patients in 
a few hospitals in a small geographic region 
are unlikely to be representative of all patients 
in the United States, for example. Moreover, 
the appropriateness of norms based on a given 
sample may diminish over time. Thus, for tests 
that have been in use for a number of years, 
periodic review is generally required to assure 
the continued utility of norms. Renorming may 
be required to maintain the validity of norm- 
referenced test score interpretations. 

More than one reference population may 
be appropriate for the same test. For example, 
achievement test performance might be inter- 
preted by reference to local norms based on 
sampling from a particular school district, 
norms for a state or type of community, or 
national norms. For other tests, norms might 
be based on occupational or educational clas- 
sifications. Descriptive statistics for all exam- 
inees who happen to be tested during a given 
period of time (sometimes called user norms 
or program norms) may be useful for some 
purposes, such as describing trends over time. 
But there must be sound reason to regard that 


group of test takers as an appropriate basis for 
such inferences. When there is a suitable ration- 
ale for using such a group, the descriptive sta- 
tistics should be clearly characterized as being 
based on a sample of persons routinely tested 
as part of an ongoing program. 

Comparability and Equating 

Many test uses involve different versions of 
the same test, which yield scores that can be 
used interchangeably even though they are 
based on different sets of items. In testing 
programs that offer a choice of examination 
dates, for example, test security may be com- 
promised if the same form is used repeatedly. 
Other testing applications may entail repeated 
measurements of the same individuals, perhaps 
to measure change in levels of psychological 
dysfunction, change in attitudes, or educa- 
tional progress. In such contexts, reuse of the 
same set of test items may result in correlated 
errors of measurement and biased estimates 
of change. When distinct forms of a test are 
constructed to the same explicit content and 
statistical specifications and administered 
under identical conditions, they are referred 
to as alternate forms or sometimes parallel or 
equivalent forms. The process of placing scores 
from such alternate forms on a common scale 
is called equating. Equating is analogous to 
the calibration of different balances so that 
they all indicate the same weight for any given 
object. However, the equating process for test 
scores is more complex. It involves small statis- 
tical adjustments to account for minor differ- 
ences in the difficulty and statistical properties 
of the alternate forms. 

In theory, equating should provide accu- 
rate score conversions for any set of persons 
drawn from the examinee population for which 
the test is designed. Furthermore, the same 
score conversion should be appropriate regard- 
less of the score interpretation or use intend- 
ed. It is not possible to construct conversions 
with these ideal properties between scores on 


51 


AERA APA NOME 0000061 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 63 of 100 

SCALES, NORMS, AND SCORE COMPARABILITY / PART I 


tests that measure different constructs; that 
differ materially in difficulty, reliability, time 
limits, or other conditions of administration; 
or that are designed to different specifications. 

There is another assessment approach 
that may provide interchangeable scores based 
on responses to different items using different 
methods, not referred to as equating. This is 
the use of adaptive tests. It has long been rec- 
ogniied that little is learned from examinees’ 
responses to items that are much too easy or 
much too difficult for them. Consequently, 
some testing procedures use only a subset of 
the available items with each examinee in 
order to avoid boredom or frustration, or to 
shorten testing time. An adaptive test con- 
sists of a pool of items together with rules 
for selecting a subset of those items to be 
administered to an individual examinee, and 
a procedure for placing different examinees’ 
scores on a common scale. The selection 
of successive items is based in part on the 
examinee’s responses to previous items. The 
item pool and item selection rules may be 
designed so that each examinee receives a 
representative set of items, of appropriate 
difficulty. The selection rules generally 
assure that an acceptable degree of precision 
is attained before testing is terminated. At 
one time, such tailored testing was limited 
to certain individually administered psy- 
chological tests. With advances in item 
response theory (IRT) and in computer 
technology, however, adaptive testing is 
becoming more sophisticated. With some 
adaptive tests, it may happen that two 
examinees rarely if ever respond to precisely 
the same set of items. Moreover, cwo exam- 
inees taking the same adaptive test may be 
given sets of items that differ markedly in 
difficulty. Nevertheless, when certain statis- 
tical and content conditions are met, test 
scores produced by an adaptive testing sys- 
tem can function like scores from equated 
alternate forms. 


Scaling to Achieve Comparability 

The term equating is properly reserved only 
for score conversions derived for alternate forms 
of the same test. It is often useful, however, to 
compare scores from tests that cannot, in the- 
ory, be equated. For example, it may be desir- 
able to interpret scores from a shortened (and 
hence less reliable) form of a test by first con- 
verting them to corresponding scores on the 
full-length version. For the evaluation of exam- 
inee growth over time, it may be desirable to 
develop scales that span a broad range of devel- 
opmental or educational levels. Test revision 
often brings a need for some linkage between 
scores obtained using newer and older editions. 
International comparative studies or use with 
hearing-impaired examinees may require test 
forms in different languages. In still other 
cases, linkages or alignments may be created 
between tests measuring different constructs, 
perhaps comparing an aptitude with a form 
of behavior, or linking measures of achieve- 
ment in several content areas. Scores from 
such tests may sometimes be aligned or pre- 
sented in a concordance table to aid users in 
estimating relative performance on one test 
from performance on another. 

Score conversions to facilitate such com- 
parisons may be described using terms like 
linkage, calibration, concordance, projection, 
moderation, or anchoring. These weaker score 
linkages may be technically sound and may 
fully satisfy desired goals of comparability for 
one purpose or for one subgroup of examinees, 
but they cannot be assumed to be stable over 
time or invariant across multiple subgroups of 
the examinee population nor is there any assur- 
ance that scores obtained using different tests 
will be equally accurate. Thus, their use for other 
purposes or with other populations than origi- 
nally intended may require additional research. 
For example, a score conversion that was accu- 
rate for a group of native speakers might sys- 
tematically overpredict or underpredict the 
scores of a group of nonnative speakers. 
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Cut Scores 

A critical step in the development and use of 
some tests is to establish one or more cut points 
dividing the score range to partition the dis- 
tribution of scores into categories. These cate- 
gories may be used just for descriptive purposes 
or may be used to distinguish among exam- 
inees for whom different programs are deemed 
desirable or different predictions are warrant- 
ed. An employer may determine a cut score 
to screen potential employees or promote cur- 
rent employees; a school may use test scores 
to decide which of several alternative instruc- 
tional programs would be most beneficial for 
a student; in granting a professional license, a 
state may specify a minimum passing score 
on a licensure test. 

These examples differ in important 
respects, but all involve delineating categories 
of examinees on the basis of test scores. Such 
cut scores embody the rules according to which 
tests are used or interpreted. Thus, in some 
situations the validity of test interpretations 
may hinge on the cut scores. There can be no 
single method for determining cut scores for 
all tests or for all purposes, nor can there be 
any single set of procedures for establishing 
their defensibility. These examples serve only 
as illustrations. 

The first example, that of an employer 
hiring all those who earn scores above a given 
level on an employment test, is most straight- 
forward. Assuming that the employment test 
is valid for its intended use, average job per- 
formance would typically be expected to rise 
steadily, albeit slowly, with each increment in 
test score, at least for some range of scores 
surrounding the cut point. In such a case the 
designation of the particular value for the cut 
point may be largely determined by the num- 
ber of persons to be hired or promoted. There 
is no sharp difference between those just below 
the cut point and those just above it, and the 
use of the cut score does not entail any crite- 
rion-referenced interpretation. This method 


of establishing a cut score may be subject to 
legal requirements with respect to the nature 
of the validity and reliability evidence needed 
to support the use of rank-order selections 
and the unavailability of effective alternative 
selection methods, if it has a disproportionate 
effect on one or more subgroups of employees 
or prospective employees. 

In the second example, a school district 
might structure its courses in writing around 
three categories of needs. For children whose 
proficiency is least developed, instruction 
might be provided in small groups, with con- 
siderable individual attention to assist them 
in creating meaningful written stories grounded 
in their own experience. For children whose 
proficiency was further developed, more empha- 
sis might be placed on systematic exploration 
of the stages of the writing process. Instruction 
for children at the highest proficiency level might 
emphasize mastery of specific writing genres 
or prose structures used in more formal writ- 
ing. In an appropriate implementation of such 
a program, children could easily be transferred 
from one level to another if their original 
placement was in error or as their proficiency 
increased. Ideally, cut scores delineating cate- 
gories in this application would be based on 
research demonstrating empirically that pupils 
in successive score ranges did most often ben- 
efit more from the respective treatments to 
which they were assigned chan from the alter- 
natives available. It would typically be found 
that between those score ranges in which one 
or another instructional treatment was clearly 
superior, there was an intermediate region in 
which neither treatment was clearly preferred. 
The cut score might be located somewhere in 
that intermediate region. 

In the final example, that of a professional 
licensure examination, the cut score represents 
an informed judgment that those scoring below 
it are likely to make serious errors for want of 
the knowledge or skills tested. Little evidence 
apart from errors made on the test itself may 
document the need to deny the right to prac- 
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tice the profession. No test is perfect, of 
course, and regardless of the cut score chosen, 
some examinees with inadequate skills are 
likely to pass and some with adequate skills 
are likely to fail. The relative probabilities of 
such false positive and false negative errors 
will vary depending on the cut score chosen. 
A given probability of exposing che public 
to potential harm by issuing a license to an 
incompetent individual (false positive) must 
be weighed against some corresponding 
probability of denying a license to, and there- 
by disenfranchising, a qualified examinee 
(false negative). Changing the cut score to 
reduce either probability will increase the 
other, although both kinds of errors can be 
minimized through sound test design that 
anticipates the role of the cut score in test use 
and interpretation. Determining cut scores 
in such situations cannot be a purely tech- 
nical matter, although empirical studies 
and statistical models can be of great value 
in informing the process. 

Cut scores embody value judgments as 
well as technical and empirical considerations. 
Where the results of the standard-setting process 
have highly significant consequences, and 
especially where large numbers of examinees 
are involved, those responsible for establish- 
ing cut scores should be concerned that the 
process by which cut scores are determined be 
clearly documented and defensible. The qual- 
ifications of any judges involved in standard 
setting and the process by which they are 
selected are part of that documentation. Care 
must be taken to assure that judges under- 
stand what they are to do. The process must 
be such that well-qualified judges can apply 
their knowledge and experience to reach 
meaningful and relevant judgments that accu- 
rately reflect their understandings and inten- 
tions. A sufficiently large and representative 
group of judges should be involved to provide 
reasonable assurance that results would not 
vary greatly if the process were replicated. 


Standard 4.1 

Test documents should provide test users 
with clear explanations of the meaning and 
intended interpretation of derived score scales, 
as well as their limitations. 

Comment: All scales (raw score or derived) may 
be subject to misinterpretation. Sometimes 
scales are extrapolated beyond the range of 
available data or are interpolated without suffi- 
cient data points. Grade- and age-equivalent 
scores have been criticized in this regard, but 
percentile ranks and standard score scales are 
also subject to misinterpretation. If the nature 
or intended uses of a scale are novel, it is espe- 
cially important that its uses, interpretations, 
and limitations be clearly described. Illustrations 
of appropriate versus inappropriate interpreta- 
tions may be helpful, especially for types of 
scales or interpretations that may be unfamiliar 
to most users. This standard pertains to score 
scales intended for criterion-referenced as well 
as for norm-referenced interpretation. 

Standard 4.2 

The construction of scales used for report- 
ing scores should be described clearly in 
test documents. 

Comment: When scales, norms, or other 
interpretive systems are provided by the test 
developer, technical documentation should 
enable users to judge the quality and preci- 
sion of the resulting derived scores. This 
standard pertains to score scales intended for 
criterion-referenced as well as for norm-refer- 
enced interpretation. 

Standard 4.3 

If there is sound reason to believe that spe- 
cific misinterpretations of a score scale are 
likely, test users should be explicitly fore- 
warned. 
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Comment: Test publishers and users can reduce 
misinterpretations of grade-equivalent scores, 
for example, by ensuring that such scores are 
accompanied by instructions that make clear 
that grade-equivalent scores do not represent a 
standard of growth per year or grade and that 
roughly 50% of the students tested in the stan- 
dardization sample should by definition fall 
below grade level. As another example, a score 
scale point originally defined as the mean of 
some reference population should no longer be 
interpreted as representing average perform- 
ance if the scale is held constant over time and 
the examinee population changes. 

Standard 4.4 

When raw scores are intended to be directly 
interpretable, their meanings, intended 
interpretations, and limitations should be 
described and justified in the same manner 
as is done for derived score scales. 

Comment: In some cases the items in a test 
are a representative sample of a well-defined 
domain of items. The proportion correct on 
the test may then be interpreted as an estimate 
of the proportion of items in the domain that 
could be answered correctly. In other cases, 
different interpretations may be attached to 
scores above or below one or another cut score. 
Support should be offered for any such inter- 
pretations recommended by the test developer. 

Standard 4.5 

Norms, if used, should refer to clearly 
described populations. These populations 
should include individuals or groups to 
whom test users will ordinarily wish to 
compare their own examinees. 

Comment: It is the responsibility of test develop- 
ers to describe norms clearly and the responsibil- 
ity of test users to employ norms appropriately. 
Users need to know the applicability of a test to 
different groups. Differentiated norms or sum- 


mary information about differences between 
gender, ethnic, language, disability, grade, or 
age groups, for example, may be useful in some 
cases. The permissible uses of such differenti- 
ated norms and related information may be 
limited by law. Users also need to be made alert 
to situations in which norms are less appropri- 
ate for some groups or individuals than others. 
On an occupational interest inventory, for 
example, norms for persons actually engaged 
in an occupation may be inappropriate for 
interpreting the scores of persons not so 
engaged. As another example, the appropri- 
ateness of norms for personality inventories 
or relationship scales may differ depending 
upon an examinees sexual orientation. 

Standard 4.6 

Reports of norming studies should include 
precise specification of the population that 
was sampled, sampling procedures and par- 
ticipation rates, any weighting of the sample, 
the dates of testing, and descriptive statistics. 
The information provided should be sufficient 
to enable users to judge the appropriateness of 
the norms for interpreting the scores of local 
examinees. Technical documentation should 
indicate the precision of the norms themselves. 

Comment: Scientific sampling is important if 
norms are to be representative of intended 
populations. For example, schools already 
using a given published test and volunteering 
to participate in a norming study for that test 
should not be assumed to be representative of 
schools in general. In addition to sampling pro- 
cedures, participation rates should be reported, 
and the method of calculating participation 
rates should be clearly described. Studies that are 
designed to be nationally representative often 
use weights so that the weighted sample better 
represents the nation than does the unweighted 
sample. When weights are used, it is important 
that the procedure for deriving the weights be 
described and that the demographic representa- 
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tion of both the weighted and the unweighted 
samples be given. If norming data are collect- 
ed under conditions in which student motiva- 
tion in completing the test is likely to differ 
from that expected during operational use, this 
should be clearly documented. Likewise, if the 
instructional histories of students in the norm- 
ing sample differ systematically from those to 
be expected during operational test use, that 
fact should be noted. Norms based on samples 
cannot be perfectly precise. Even though the 
imprecision of norm-referenced interpretations 
due to imperfections in the norms themselves 
may be small compared to that due to meas- 
urement error, estimates of the precision of 
norms should be available in technical docu- 
mentation. For example, standard errors based 
on the sample design might be presented. In 
some testing applications, norms based on all 
examinees tesced over a given period of time 
may be useful for some purposes. Such norms 
should be clearly characterized as being based 
on a sample of persons routinely tested as part 
of an ongoing testing program. 

Standard 4.7 

If local examinee groups differ materially 
from the populations to which norms refer, a 
user who reports derived scores based on the 
published norms has the responsibility to 
describe such differences if they bear upon 
the interpretation of the reported scores. 

Comment: In employment settings, the qualifi- 
cations of local examinee groups may fluctuate 
depending on recruitment or referral proce- 
dures as well as market conditions. In such 
cases, appropriate test use and interpretation 
may not require documentation or cautions 
concerning departures from characteristics of 
the norming population. 

Standard 4.8 

When norms are used to characterize exam- 
inee groups, the statistics used to summarize 


each group’s performance and the norms to 
which those statistics are referred should be 
clearly defined and should support the 
intended use or interpretation. 

Comment: Group means are distributed dif- 
ferently from individual scores. For example, 
it is not possible to determine the percentile 
rank of a school’s average test score if all that is 
known are the percentile ranks of each of that 
school’s students. It may sometimes be useful to 
develop special norms for group means, but 
when the sizes of the groups differ materially 
or when some groups are much more heteroge- 
neous than others, the construction and inter- 
pretation of group norms is problematical. One 
common and acceptable procedure is to report 
the percentile rank of the median group 
member, for example, the median percentile 
rank of the pupils tested in a given school. 

Standard 4.9 

When raw score or derived score scales are 
designed for criterion-referenced interpreta- 
tion, including the classification of exam- 
inees into separate categories, the rationale 
for recommended score interpretations 
should be clearly explained. 

Comment: Criterion-referenced interpretations 
are score-based descriptions or inferences that 
do not take the form of comparisons to the test 
performance of other examinees. Examples 
include statements that some psychopathology 
is likely present, that a prospective employee 
possesses specific skills required in a given posi- 
tion, or that a child scoring above a certain score 
point can successfully apply a given set of skills. 
Such interpretations may refer to the absolute 
levels of test scores or to patterns of scores for 
an individual examinee. Whenever the test 
developer recommends such interpretations, 
the rationale and empirical basis should be 
clearly presented. Serious efforts should be 
made whenever possible to obtain independent 
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evidence concerning the soundness of such 
score interprerations. Crirerion-referenced 
and norm-referenced scales are not mutually 
exclusive. Given adequate supporting data, 
scores may be interpreted by both approaches, 
not necessarily just one or the other. 

Standard 4.10 

A clear rationale and supporting evidence 
should be provided for any claim that scores 
earned on different forms of a test may be 
used interchangeably. In some cases, direct 
evidence of score equivalence may be provid- 
ed. In other cases, evidence may come from 
a demonstration that the theoretical assump- 
tions underlying procedures for establishing 
score comparability have been sufficiemly sat- 
isfied. The specific rationale and the evidence 
required will depend in part on the intended 
uses for which score equivalence is claimed. 

Comment: Support should be provided for any 
assertion that scores obtained using different 
items or testing materials, or different testing 
procedures, are interchangeable for some pur- 
pose. This standard applies, for example, to 
alternate forms of a paper-and-pencil test or 
to alternate sets of items taken by different 
examinees in computerized adaptive testing. 

It also applies to test forms administered in 
different formats (e.g., paper-and-pencil and 
computerized tests) or test forms designed for 
individual versus group administration. Score 
equivalence is easiest to establish when differ- 
ent forms are constructed following identical 
procedures and then equated statistically. When 
that is not possible, for example, in cases where 
different test formats are used, additional evi- 
dence may be required to establish the requisite 
degree of score equivalence for the intended 
contexr and purpose. When recommended 
inferences or actions are based solely on classifi- 
cations of examinees into one of two or more 
categories, the rationale and evidence should 
address consistency of classification. If the only 


score reported and used is a pass-fail decision, 
for example, then the form-to-form equiva- 
lence of measurements for examinees far above 
or far below the cut score is of no concern. 
Some testing accommodations may only affect 
the dependence of test scores on capabilities 
irrelevant to the construct the test is intended 
to measure. Use of a large-print edition, for 
example, assures that performance does not 
depend on the ability to perceive standard-size 
print. In such cases, relatively modest studies 
or professional judgment may be sufficient to 
support claims of score equivalence. 

Standard 4.11 

When claims of form-to-form score equiva- 
lence are based on equating procedures, 
detailed technical information should be 
provided on the method by which equating 
functions or other linkages were established 
and on the accuracy of equating functions. 

Comment: The fundamental concern is to 
show that equated scores measure essentially 
the same construct, with very similar levels of 
reliability and conditional standard errors of 
measurement. Technical information should 
include the design of equating studies, the 
statistical methods used, the size and relevant 
characteristics of examinee samples used in 
equating studies, and the characteristics of any 
anchor tests or linking items. Standard errors 
of equating functions should be estimated and 
reported whenever possible. Sample sizes per- 
mitting, it may be informative to determine 
equating functions independently for identifi- 
able subgroups of examinees. It may also be 
informative to use two anchor forms and to 
conduct the equating using each of the anchors. 
In some cases, equating functions may be deter- 
mined independently using different statistical 
methods. The correspondence of separate func- 
tions obtained by such methods can lend sup- 
port to the adequacy of the equating results. Any 
substantial disparities found by such methods 
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should be resolved or reported. To be most 
useful, equating error should be presented in 
units of the reported score scale. For testing 
programs with cut scores, equating error near 
the cut score is of primary importance. The 
degree of scrutiny of equating functions should 
be commensurate with the extent of test use 
anticipated and che importance of the deci- 
sions the test scores are intended to inform. 

Standard 4.12 

In equating studies that rely on the statisti- 
cal equivalence of examinee groups receiving 
different forms, methods of assuring such 
equivalence should be described in detail. 

Comment: Certain equating designs rely on the 
random equivalence of groups receiving different 
forms. Often, one way to assure such equivalence 
is to systematically mix different test forms and 
then distribute them in a random fashion so 
that roughly equal numbers of examinees in 
each group tested receive each form. 

Standard 4.13 

In equating studies that employ an anchor 
test design, the characteristics of the anchor 
test and its similarity to the forms being 
equated should be presented, including both 
content specifications and empirically deter- 
mined relationships among test scores. If 
anchor items are used, as in some IRT-based 
and classical equating studies, the represen- 
tativeness and psychometric characteristics 
of anchor items should be presented. 

Comment: Tests or test forms may be linked 
via common items embedded within each of 
them, or a common test administered togeth- 
er with each of them. These common items 
or tests are referred to as linking items, anchor 
items, or anchor tests. With such methods, 
the quality of the resulting equating depends 
strongly on the adequacy of the anchor tests 
ot items used. 


Standard 4.14 

When score conversions or comparison pro- 
cedures are used to relate scores on tests or 
test forms that are not closely parallel, the 
construction, intended interpretation, and 
limitations of those conversions or compar- 
isons should be clearly described. 

Comment: Various score conversions or con- 
cordance tables have been constructed relating 
tests at different levels of difficulty, relating 
earlier to revised forms of published tests, cre- 
ating score concordances between different 
tests of similar or different constructs, or for 
other purposes. Such conversions are often 
useful, but they may also be subject to misin- 
terpretation. The limitations of such conver- 
sions should be clearly described. 

Standard 4.15 

When additional test forms are created by tak- 
ing a subset of the items in an existing test form 
or by rearranging its items and there is sound 
reason to believe that scores on these forms 
may be influenced by item context effects, 
evidence should be provided that there is no 
undue distortion of norms for the different 
versions or of score linkages between them. 

Comment: Some tests and test batteries are 
published in both a full-length version and a 
survey or short version. In other cases, multi- 
ple versions of a single test form may be cre- 
ated by rearranging its items. It should not be 
assumed that performance data derived from 
the administration of items as part of the ini- 
tial version can be used to approximate norms 
or construct conversion tables for alternative 
intact tests. Due caution is required in cases 
where context effects are likely, including 
speeded tests, long tests where fatigue may be 
a factor, and so on. In many cases, adequate 
psychometric data may only be obtainable 
from independent administrations of the 
alternate forms. 
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Standard 4.16 

If test specifications are changed from one 
version of a test to a subsequent version, such 
changes should be identified in the test man- 
ual, and an indication should be given that 
converted scores for the two versions may not 
be strictly equivalent. When substantial 
changes in test specifications occur, either 
scores should be reported on a new scale or 
a clear statement should be provided to alert 
users that the scores are not directly compara- 
ble with those on earlier versions of the test. 

Comment: Major shifts sometimes occur in the 
specifications of tests that are used for substan- 
tial periods of time. Often such changes take 
advantage of improvements in item types or 
of shifts in content that have been shown to 
improve validity and, therefore, are highly 
desirable. It is important to recognize, howev- 
er, that such shifts will result in scores that 
cannot be made strictly interchangeable with 
scores on an earlier form of the test. 

Standard 4.17 

Testing programs that attempt to maintain 
a common scale over time should conduct 
periodic checks of the stability of the scale 
on which scores are reported. 

Comment: In some testing programs, items are 
introduced into and retired from item pools on 
an ongoing basis. In other cases, the items in suc- 
cessive test forms may overlap very little, or not 
at all. In either case, if a fixed scale is used for re- 
porting, it is important to assure that the mean- 
ing of the scaled scores does not change over time. 

Standard 4.18 

If a publisher provides norms for use in test 
score interpretation, then so long as the test 
remains in print, it is the publishers responsi- 
bility to assure that the test is renormed with 
sufficient frequency to permit continued accu- 
rate and appropriate score interpretations. 


Comment: Test publishers should assure that 
up-to-date norms are readily available, but it 
remains the test user’s responsibility to avoid 
inappropriate use of norms that are out of date 
and to strive to assure accurate and appropri- 
ate test interpretations. 

Standard 4.19 

When proposed score interpretations involve 
one or more cut scores, the rationale and 
procedures used for establishing cut scores 
should be clearly documented. 

Comment: Cut scores may be established to 
select a specified number of examinees (e.g., 
to fill existing vacancies), in which case little 
further documentation may be needed con- 
cerning the specific question of how the cut 
scores are established, though attention should 
be paid to legal requirements that may apply. 
In other cases, however, cut scores may be used 
to classify examinees into distinct categories 
(e.g., diagnostic categories, or passing versus 
failing) for which there are no preestablished 
quotas. In these cases, the standard-setting 
method must be clearly documented. Ideally, 
the role of cut scores in test use and interpre- 
tation is taken into account during test design. 
Adequate precision in regions of score scales 
where cut points are established is prerequisite 
to reliable classification of examinees into cat- 
egories. If standard setting employs data on the 
score distributions for criterion groups or on 
the relation of test scores to one or more criteri- 
on variables, those data should be summarized 
in technical documentation. If a judgmental 
standard-setting process is followed, the method 
employed should be clearly described, and the 
precise nature of the judgments called for should 
be presented, whether those are judgments of 
persons, of item or test performances, or of 
other criterion performances predicted by test 
scores. Documentation should also include the 
selection and qualification of judges, training 
provided, any feedback to judges concerning 
the implications of their provisional judgments, 
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and any opportunities for judges to confer with 
one another. Where applicable, variability over 
judges should be reported. Whenever feasible, an 
estimate should be provided of the amount of 
variation in cut scores that might be expected if 
the standard-setting procedure were replicated. 

Standard 4.20 

When feasible, cut scores defining categories 
with distinct substantive interpretations 
should be established on the basis of sound 
empirical data concerning the relation of test 
performance to relevant criteria. 

Comment: In employment settings, although 
it is important to establish that test scores are 
related to job performance, the precise rela- 
tion of test and criterion may have little bear- 
ing on the choice of a cut score. However, in 
contexts where distinct interpretations are 
applied to different score categories, the 
empirical relation of test to criterion assumes 
greater importance. Cut scores used in inter- 
preting diagnostic tests may be established on 
the basis of empirically determined score dis- 
tributions for criterion groups. With achieve- 
ment or proficiency tests, such as those used 
in licensure, suitable criterion groups (e.g., 
successful versus unsuccessful practitioners) 
are often unavailable. Nonetheless, it is highly 
desirable, when appropriate and feasible, to 
investigate the relation between test scores 
and performance in relevant practical settings. 
Note that a carefully designed and imple- 
mented procedure based solely on judgments 
of content relevance and item difficulty may 
be preferable to an empirical study with an 
inadequate criterion measure or other defi- 
ciencies. Professional judgment is required 
to determine an appropriate standard-setting 
approach (or combination of approaches) in 
any given situation. In general, one would 
not expect to find a sharp difference in levels 
of the criterion variable between those just 


below versus just above the cut score, but evi- 
dence should be provided where feasible of a 
relationship between test and criterion per- 
formance over a score interval that includes 
or approaches the cut score. 

Standard 4.21 

When cut scores defining pass-fail or profi- 
ciency categories are based on direct judg- 
ments about the adequacy of item or test 
performances or performance levels, the 
judgmental process should be designed so 
that judges can bring their knowledge and 
experience to bear in a reasonable way. 

Comment: Cut scores are sometimes based on 
judgments about the adequacy of item or test 
performances (e.g., essay responses to a writ- 
ing prompt) or performance levels (e.g., the 
level that would characterize a borderline 
examinee). The procedures used to elicit such 
judgments should result in reasonable, defensi- 
ble standards that accurately reflect the judges’ 
values and intentions. Reaching such judgments 
may be most straightforward when judges are 
asked to consider kinds of performances with 
which they are familiar and for which they 
have formed clear conceptions of adequacy or 
quality. When the responses elicited by a test 
neither sample nor closely simulate the use of 
tested knowledge or skills in the actual criteri- 
on domain, judges are not likely to approach 
the task with such clear understandings. Special 
care must then be taken to assure that judges 
have a sound basis for making the judgments 
requested. Thorough familiarity wich descrip- 
tions of different proficiency categories, prac- 
tice in judging task difficulty with feedback 
on accuracy, the experience of actually taking 
a form of the test, feedback on the failure 
rates entailed by provisional standards, and 
other forms of information may be beneficial 
in helping judges to reach sound and princi- 
pled decisions. 
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Background 

The usefulness and interpretability of test 
scores require that a test be administered and 
scored according to the developer’s instruc- 
tions. When directions ro examinees, testing 
conditions, and scoring procedures follow the 
same detailed procedures, the test is said to be 
standardized. Without such standardization, 
the accuracy and comparability of score inter- 
pretations would be reduced. For tests designed 
to assess the examinee’s knowledge, skills, or 
abilities, standardization helps to ensure that 
all examinees have the same opportunity to 
demonstrate their competencies. Maintaining 
test security also helps to ensure that no one 
has an unfair advantage. 

Occasionally, however, situations arise in 
which modifications of standardized procedures 
may be advisable or legally mandated. Persons 
of different backgrounds, ages, or familiarity 
with testing may need nonstandard modes of 
test administration or a more comprehensive 
orientation to the testing process, in order that 
all test takers can come to the same under- 
standing of the task. Standardized modes of 
presenting information or of responding may 
not be suitable for specific individuals, such 
as persons with some kinds of disability, or 
persons with limited proficiency in the language 
of the test, so that accommodations may be 
needed (see chapters 9 and 10). Large-scale 
testing programs generally have established 
specific procedures to be used in considering 
and granting accommodations. Some test users 
feel that any accommodation not specifically 
required by law could lead to a charge of 
unfair treatment and discrimination. Although 
accommodations are made with the intent of 
maintaining score comparability, the extent 
to which that is possible may not be known. 
Comparability of scores may be compromised, 
and the test may then not measure the same 
constructs for all test takers. 


Tests and assessments differ in their degree 
of standardization. In many instances different 
examinees are given not the same test form, but 
equivalent forms that have been shown to yield 
comparable scores. Some assessments permit 
examinees to choose which tasks to perform or 
which pieces of their work are to be evaluated. 
A degree of standardization can be maintained 
by specifying the conditions of the choice and 
the criteria of evaluation of the products. When 
an assessment permits a certain kind of collabo- 
ration, the limits of that collaboration can be 
specified. With some assessments, test adminis- 
trators may be expected to tailor their instruc- 
tions to help assure that all examinees understand 
what is expected of them. In all such cases, the 
goal remains the same: to provide accurate and 
comparable measurement for everyone, and 
unfair advantage to no one. The degree of 
standardization is dictated by that goal, and 
by the intended use of the test. 

Standardized directions to test takers 
help to ensure that all test takers understand 
the mechanics of test taking. Directions gen- 
erally inform test takers how to make their 
responses, what kind of help they may legiti- 
mately be given if they do not understand 
the question or task, how they can correct 
inadvertent responses, and the nature of any 
time constraints. General advice is some- 
times given about omitting item responses. 
Many tests, including computer-administered 
tests, require special equipment. Practice exer- 
cises are often presented in such cases to ensure 
that the test taker understands how to operate 
the equipment. The principle of standardiza- 
tion includes orienting test takers to materials 
with which they may not be familiar. Some 
equipment may be provided at the cesting site, 
such as shop tools or balances. Opportunity 
for test takers to practice with the equipment 
will often be appropriate, unless using the 
equipment is the purpose of the test. 
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Tests are sometimes administered by 
computer, with test responses made by key- 
board, computer mouse, or similar device. 
Although many test takers are accustomed 
to computers, some are not and may need 
some brief explanation. Even those test tak- 
ers who use computers will need to know 
about some details. Special issues arise in 
managing the testing environment, such as 
the arrangement of illumination so that 
light sources do not reflect on the computer 
screen, possibly interfering with display leg- 
ibility. Maintaining a quiet environment 
can be challenging when candidates are test- 
ed separately, starting at different times and 
finishing at different times from neighbor- 
ing test takers. Those who administer com- 
puter-based tests require training in the 
hardware and software used for the test, so 
that they can deal with problems that may 
arise in human-computer interactions. 

Standardized scoring procedures help 
to ensure accurate scoring and reporting, 
which are essential in all circumstances. When 
scoring is done by machine, the accuracy of 
the machine is at issue, including any scoring 
algorithm. When scoring is done by human 
judges, scorers require careful training. Regular 
monitoring can also help to ensure that every 
test protocol is scored according to the same 
standardized criteria and that the criteria do 
not change as the test scorers progress through 
the submitted test responses. 

Test scores, per se, are not readily inter- 
preted without other information, such as 
norms or standards, indications of measure- 
ment error, and descriptions of test content. 
Just as a temperature of 50° in January is 
warm for Minnesota and cool for Florida, a 
test score of 50 is not meaningful without 
some context. When the scores are to be 
reported to persons who are not technical 
specialists, interpretive material can be pro- 
vided that is readily understandable to those 
receiving the report. Often, the test user 


provides an interpretation of the results for 
the test taker, suggesting the limitations of 
the results and the relationship of any reported 
scores to other information. Scores on some 
tests are not designed to be released to test 
takers; only broad test interpretations, or 
dichotomous classifications, such as pass/fail, 
are intended to be reported. 

Interpretations of test results are some- 
times prepared by computer systems. Such 
interpretations are generally based on a com- 
bination of empirical data and expert judg- 
ment and experience. In some professional 
applications of individualized testing, the 
computer-prepared interpretations are com- 
municated by a professional, possibly with 
modifications for special circumstances. 
Such test interpretations require validation. 
Consistency with interptetations provided by 
nonalgorithmic approaches is clearly a concern. 

In some large-scale assessments, the pri- 
mary target of assessment is not the individ- 
ual test taker but is a larger unit, such as a 
school district or an industrial plant. Often, 
different test takers are given different sets 
of items, following a carefully balanced matrix 
sampling plan, to broaden the range of infor- 
mation that can be obtained in a reasonable 
time period. The results acquire meaning 
when aggregated over many individuals taking 
different samples of items. Such assessments 
may not furnish enough information to sup- 
port even minimally valid, reliable scores for 
individuals, as each individual may take only 
an incomplete test. 

Some further issues of administration 
and scoring are discussed in chapter 3, “Test 
Development and Revision.” 
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Standard 5.1 

Test administrators should follow carefully 
the standardized procedures for administra- 
tion and scoring specified by the test devel- 
oper, unless the situation or a test taker’s 
disability dictates that an exception should 
be made. 

Comment: Specifications regarding instruc- 
tions to test takers, time limits, the form of 
item presentation or response, and test mate- 
rials or equipment should be strictly observed. 
In general, the same procedures should be 
followed as were used when obtaining the 
data for scaling and norming the test scores. 
A test taker with a disabling condition may 
require special accommodation. Other special 
circumstances may require some flexibility in 
administration, judgments of the suitability 
of adjustments should be tempered by the 
consideration that departures from standard 
procedures may jeopardize the validity of the 
test score interpretations. 

Standard 5.2 

Modifications or disruptions of standardized 
test administration procedures or scoring 
should be documented. 

Comment: Information about the nature of 
modifications of administration should be 
maintained in secure data files, so that research 
studies or case reviews based on test records 
can take this into account. This includes not 
only special accommodations for particular 
test takers, but also disruptions in the testing 
environment that may affect all test takers in 
the testing session. A researcher may wish to 
use only the records based on standardized 
administration. In other cases, research stud- 
ies may depend on such information to form 
groups of respondents. Test users or test spon- 
sors should establish policies concerning who 
keeps the files and who may have access to 
the files. Whether the information about 


modifications is reported to users of test data, 
such as admissions officers, depends on dif- 
ferent considerations (see chapters 8 and 10). 
If such reports are made, certain cautions may 
be appropriate. 

Standard 5.3 

When formal procedures have been estab- 
lished for requesting and receiving accom- 
modations, test takers should be informed 
of these procedures in advance of testing. 

Comment: When large-scale testing programs 
have established strict procedures to be fol- 
lowed, administrators should not depart from 
these procedures. 

Standard 5.4 

The testing environment should furnish rea- 
sonable comfort with minimal distractions. 

Comment: Noise, disruption in the testing 
area, extremes of temperature, poor lighting, 
inadequate work space, illegible materials, 
and so forth are among the conditions that 
should be avoided in testing situations. The 
testing site should be readily accessible. 
Testing sessions should be monitored where 
appropriate to assist the test taker when a 
need arises and to maintain proper adminis- 
trative procedures. In general, the testing 
conditions should be equivalent to those that 
prevailed when norms and other interpreta- 
tive data were obtained. 

Standard 5.5 

Instructions to test takers should clearly 
indicate how to make responses. Instructions 
should also be given in the use of any equip- 
ment likely to be unfamiliar to test takers. 
Opportunity to practice responding should 
be given when equipment is involved, unless 
use of the equipment is being assessed. 
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Comment: When electronic calculators are pro- 
vided for use, examinees may need practice in 
using the calculator. Examinees may need 
practice responding with unfamiliar tasks, such 
as a numeric grid, which is someumes used with 
mathematics performance items. In computer- 
administered tests, the method of responding 
may be unfamiliar to some test takers. Where 
possible, the practice responses should be mon- 
itored to ensure that the test taker is making 
acceptable responses. In some performance tests 
that involve tools or equipment, instructions may 
be needed for unfamiliar tools, unless accommo- 
dating to unfamiliar took is pan of what is being 
assessed. If a test taker is unable to use the equip- 
ment or make the responses, it may be appropri- 
ate to consider alternative testing modes. 

Standard 5.6 

Reasonable efforts should be made to assure 
the integrity of test scores by eliminating 
opportunities for test takers to attain scores 
by fraudulent means. 

Comment: In large-scale testing programs where 
the results may be viewed as having important 
consequences, efforts to assure score integrity 
should include, when appropriate and practi- 
cable, stipulating requirements for identifica- 
tion, constructing seating charts, assigning 
test takers to seats, requiring appropriate space 
between seats, and providing continuous 
monitoring of the testing process. Test devel- 
opers should design test materials and proce- 
dures to minimize the possibility of cheating. 
Test administrators should note and report 
any significant instances of testing irregularity. 

A local change in the date or time of testing 
may offer an opportunity for fraud. In gener- 
al, steps should be taken to minimize the pos- 
sibility of breaches in test security. In any 
evaluation of work products (e.g., portfolios) 
steps should be taken to ensure that the prod- 
uct tepresents the candidate’s own work, and 
that the amount and kind of assistance pro- 
vided should be consistent with the intent of 
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the assessment. Ancillary documentation, 
such as the date when the work was done, 
may be useful. 


Test users have the responsibility of protect- 
ing the security of test materials at all times. 

Comment: Those who have test materials 
under their control should, with due consid- 
eration of ethical and legal requirements, take 
all steps necessary to assure that only individ- 
uals with a legitimate need for access to test 
materials are able to obtain such access before 
the test administration, and afterwards as 
well, if any part of the test will be reused at a 
later time. Test users must balance test securi- 
ty with the rights of all test takers and test 
users. When sensitive test documents are 
challenged, it may be appropriate to employ 
an independent third parry, using a closely 
supervised secure procedure to conduct a 
review of the relevant materials. Such secure 
procedures are usually preferable to placing 
tests, manuals, and an examinees test respons- 
es in the public record. 


Test scoring services should document the 
procedures that were followed to assure 
accuracy of scoring. The frequency of scor- 
ing errors should be monitored and reported 
to users of the service on reasonable request. 
Any systematic source of scoring errors 
should be corrected. 

Comment: Clerical and mechanical errors 
should be examined. Scoring errors should 
be minimized and, when they are found, 
steps should be taken promptly to minimize 
their recurrence. 


When test scoring involves human judgment, 
scoring rubrics should specify criteria for scor- 
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STANDARDS 


ing. Adherence to established scoring criteria 
should be monitored and checked regularly. 
Monitoring procedures should be documented. 

Comment: Human scorers may be provided 
with scoring rubrics listing acceptable alterna- 
tive responses, as well as general criteria. 
Consistency of scoring is often checked by 
rescoring randomly selected test responses 
and by rescoring some responses from earlier 
administrations. Periodic checks of the statis- 
tical properties (e.g., means, standard devia- 
tions) of scores assigned by individual scorers 
during a scoring session can provide feedback 
for the scorers, helping them to maintain 
scoring standards. Lack of consistent scoring 
may call for retraining or dismissing some scor- 
ers or for reexamining the scoring rubrics. 

Standard 5.10 

When test score information is released to 
students, parents, legal representatives, teach- 
ers, clients, or the media, those responsible 
for testing programs should provide appro- 
priate interpretations. The interpretations 
should describe in simple language what the 
test covers, what scores mean, the precision 
of the scores, common misinterpretations of 
test scores, and how scores will be used. 

Comment: Test users should consult the inter- 
pretive material prepared by the test developer 
or publisher and should revise or supplement 
the material as necessary to present the local and 
individual results accurately and clearly. Score 
precision might be depicted by error bands, 
or likely score ranges, showing the standard 
error of measurement. 

Standard 5.11 

When computer-prepared interpretations of 
test response protocols are reported, the 
sources, rationale, and empirical basis for 
these interpretations should be available, 
and their limitations should be described. 


Comment: Whereas computer-prepared inter- 
pretations may be based on expert judgment, 
the interpretations are of necessity based 
on accumulated experience and may not be 
able to take into consideration the context of 
the individual's circumstances. Computer- 
prepared interpretations should be used with 
care in diagnostic settings, because they 
may not take into account other information 
about the individual test taker, such as age, 
gender, education, prior employment, and 
medical history, that provide context for 
test results. 

Standard 5.12 

When group-level information is obtained 
by aggregating the results of partial tests 
taken by individuals, validity and reliability 
should be reported for the level of aggrega- 
tion at which results are reported. Scores 
should not be reported for individuals unless 
the validity, comparability, and reliability of 
such scores have been established. 

Comment: Large-scale assessments often 
achieve efficiency by “matrix sampling” of 
the content domain by asking different test 
takers different questions. The testing then 
requires less time from each test taker, while 
the aggregation of individual results provides 
for domain coverage that can be adequate 
for meaningful group- or program-level 
interpretations, such as schools, or grade 
levels within a locality or particular subject- 
matter areas. Because the individual receives 
only an incomplete test, an individual score 
would have limited meaning. If individual 
scores are provided, comparisons between 
scores obtained by different individuals are 
based on responses to items that may cover 
different material. Some degree of calibra- 
tion among incomplete tests can sometimes 
be made. Such calibration is essential to the 
comparisons of individual scores. 
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Standard 5.13 

Transmission of individually identified test 
scores to authorized individuals or institu- 
tions should be done in a manner that pro- 
tects the confidential nature of the scores. 

Comment: Care is always needed when com- 
municating the scores of identified test takers, 
regardless of the form of communication. 
Face-to-face communication, as well as tele- 
phone and written communication present 
well-known problems. Transmission by elec- 
tronic media, including computer networks 
and facsimile, presents modern challenges 
to confidentiality. 

Standard 5.14 

When a material error is found in test scores 
or other important information released by a 
testing organization or other institution, a 
corrected score report should be distributed 
as soon as practicable to all known recipients 
who might otherwise use the erroneous scores 
as a basis for decision making. The corrected 
report should be labeled as such. 

Comment: A material error is one that could 
change the interpretation of the test score. 
Innocuous typographical errors would be 
excluded. Timeliness is essential for decisions 
that will be made soon after the test scores 
are received. 

Standard 5.15 

When test data about a person are retained, 
both the test protocol and any written 
report should also be preserved in some 
form. Test users should adhere to the poli- 
cies and record-keeping practice of their 
professional organizations. 

Comment: The protocol may be needed to 
respond to a possible challenge from a test 
taker. The protocol would ordinarily be 


accompanied by testing materials and test 
scores. Retention of more detailed records of 
responses would depend on circumstances 
and should be covered in a retention policy 
(see the following standard). Record keeping 
may be subject to legal and professional 
requirements. Policy for the release of any test 
information for other than research purposes 
is discussed in chapter 8. 

Standard 5.16 

Organizations that maintain test scores on 
individuals in data files or in an individual’s 
records should develop a clear set of policy 
guidelines on the duration of retention of an 
individuals records, and on the availability, 
and use over time, of such data. 

Comment: In some instances, test scores 
become obsolete over rime, no longer 
reflecting the current state of the test taker. 
Outdated scores should generally not be used 
or made available, except for research purpos- 
es. In other cases, test scores obtained in past 
years can be useful as, for example, in longi- 
tudinal assessment. The key issue is the valid 
use of the information. Score retention and 
disclosure may be subject to legal and profes- 
sional requirements. 
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Background 

The provision of supporting documents for 
tests is the primary means by which test 
developers, publishers, and distributors com- 
municate with test users. These documents 
are evaluated on the basis of their complete- 
ness, accuracy, currency, and clarity and 
should be available to qualified individuals as 
appropriate. A test’s documentation typically 
specifies the nature of the test; its intended 
use; the processes involved in the test’s devel- 
opment; technical information related to 
scoring, interpretation, and evidence of valid- 
ity and reliability; scaling and norming if 
appropriate to the instrument; and guidelines 
for test administration and interpretation. 
The objective of the documentation is to pro- 
vide test users with the information needed to 
make sound judgments about the nature and 
quality of the test, the resulting scores, and 
the interpretations based on the test scores. 
The information may be reported in docu- 
ments such as test manuals, technical manu- 
als, users guides, specimen sets, examination 
kits, directions for test administrators and 
scorers, or preview materials for test takers. 

Test documentation is most effective if it 
communicates information to multiple user 
groups. To accommodate the breadth of 
training of professionals who use tests, sepa- 
rate documents or sections of documents may 
be written for identifiable categories of users 
such as practitioners, consultants, administra- 
tors, researchers, and educators. For example, 
the test user who administers the tests and 
interprets the results needs interpretive infor- 
mation or guidelines. On the other hand, 
those who are responsible for selecting tests 
need to be able to judge the technical adequa- 
cy of the test. Therefore, some combination 
of technical manuals, user’s guides, test man- 
uals, test supplements, examination kits, or 


specimen sets ordinarily is published to pro- 
vide a potential test user or test reviewer with 
sufficient information to evaluate the appro- 
priateness and technical adequacy of the test. 
The types of information presented in these 
documents typically include a description of 
the intended test-taking population, stated 
purpose of the test, test specifications, item 
formats, scoring procedures, and the test 
development process. Technical data, such as 
psychometric indices of the items, reliability 
and validity evidence, normative data, and 
cut scores or configural rules including those 
for computer-generated interpretations of test 
scores also are summarized. 

An essential feature of the documentation 
for every test is a discussion of the known 
appropriate and inappropriate uses and inter- 
pretations of the test scores. The inclusion of 
illustrations of score interpretations, as they 
relate to the test developer’s intended applica- 
tions, also will help users make accurate infer- 
ences on the basis of the test scores. When 
possible, illustrations of improper test uses and 
inappropriate test score interpretations will 
help guard against the misuse of the test. 

Test documents need to include enough 
information to allow test users and reviewers 
to determine the appropriateness of the test 
for its intended purposes. References to other 
materials that provide more details about 
research by the publisher or independent 
investigators should be cited and should be 
readily obtainable by the test user or reviewer. 
This supplemental material can be provided 
in any of a variety of published or unpub- 
lished forms; when demand is likely to be 
low, it may be maintained in archival form, 
including electronic storage. Test documenta- 
tion is useful for all test instruments, includ- 
ing those that are developed exclusively for 
use within a single organization. 
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In addition to technical documentation, 
descriptive materials are needed in some set- 
tings to inform examinees and other interested 
parties about the nature and content of the 
test. The amount and type of information 
will depend on the particular test and appli- 
cation. For example, in situations requiring 
informed consent, information should be suf- 
ficient to develop a reasoned judgment. Such 
information should be phrased in nontechni- 
cal language and should be as inclusive as is 
consistent with the use of the test scores. The 
materials may include a general description 
and rationale for the test; sample items or 
complete sample tests; and information about 
conditions of test administration, confiden- 
tiality, and retention of test results. For some 
applications, however, the true nature and 
purpose of a test are purposely hidden or dis- 
guised to prevent faking or response bias. In 
these instances, examinees may be motivated 
to reveal more or less of the characteristics 
intended to be assessed. Under these circum- 
stances, hiding or disguising the true nature 
or purpose of the test is acceptable provided 
this action is consistent with legal principles 
and ethical standards. 

This chapter provides general standards 
for the preparation and publication of test 
documentation. The other chapters contain 
specific standards that will be useful to test 
developers, publishers, and distributors in che 
preparation of materials to be included in a 
test’s documentation. 


Standard 6.1 

Test documents (e.g., test manuals, technical 
manuals, user’s guides, and supplemental 
material) should be made available to prospec- 
tive test users and other qualified persons at 
the time a test is published or released for use. 

Comment: The test developer or publisher 
should judge carefully which information 
should be included in first editions of the test 
manual, technical manual, or user’s guides 
and which information can be provided in 
supplements. For low-volume, unpublished 
tests, the documentation may be relatively brief. 
When the developer is also the user, docu- 
mentation and summaries are still necessary. 

Standard 6.2 

Test documents should be complete, accu- 
rate, and clearly written so that the intended 
reader can readily understand the content. 

Comment: Test documents should provide 
sufficient detail to permit reviewers and 
researchers to judge or replicate important 
analyses published in che test manual. For 
example, reporting correlation matrices in 
the test document may allow the test user 
to judge the data upon which decisions and 
conclusions were based, or describing in 
detail the sample and the nature of any factor 
analyses that were conducted will allow the 
test user to replicate reported studies. 

Standard 6.3 

The rationale for the test, recommended 
uses of the test, support for such uses, and 
information that assists in score interpreta- 
tion should be documented. Where particu- 
lar misuses of a test can be reasonably 
anticipated, cautions against such misuses 
should be specified. 

Comment: Test publishers make every effort 
to caution test users against known misuses of 
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tests. However, test publishers are not required 
to anticipate all possible misuses of a test. If 
publishers do know of persistent test misuse 
by a test user, extraordinary educational 
efforts may be appropriate. 

Standard 6.4 

The population for whom the test is intended 
and the test specifications should be docu- 
mented. If applicable, the item pool and scale 
development procedures should be described 
in the relevant test manuals. If normative data 
are provided, the norming population should 
be described in terms of relevant demographic 
variables, and the year(s) in which the data 
were collected should be reported. 

Comment: Known limitations of a test for cer- 
tain populations also should be clearly delin- 
eated in the test documents. In addition, if 
the test is available in more than one language, 
test documents should provide information 
on the translation or adaptation procedures, 
on the demographics of each norming sample, 
and on score interpretation issues for each lan- 
guage into which the test has been translated. 

Standard 6.5 

When statistical descriptions and analyses 
that provide evidence of the reliability of 
scores and the validity of their recommended 
interpretations are available, the information 
should be included in the test’s documenta- 
tion. When relevant for test interpretation, 
test documents ordinarily should include 
item level information, cut scores and con- 
figural rules, information about raw scores 
and derived scores, normative data, the stan- 
dard errors of measurement, and a descrip- 
tion of the procedures used to equate 
multiple forms. 

Standard 6.6 

When a test relates to a course of training or 
study, a curriculum, a textbook, or packaged 


instruction, the documentation should include 
an identification and description of the course 
or instructional materials and should indicate 
the year in which these materials were prepared. 

Standard 6.7 

Test documents should specify qualifications 
that are required to administer a test and to 
interpret the test scores accurately. 

Comment: Statements of user qualifications 
need to specify the training, certification, 
competencies, or experience needed to have 
access to a test. 

Standard 6.8 

If a test is designed to be scored or interpre- 
ted by test takers, the publisher and test 
developer should provide evidence that the 
test can be accurately scored or interpreted 
by the test takers. Tests that are designed to 
be scored and interpreted by the test taker 
should be accompanied by interpretive 
materials that assist the individual in under- 
standing the test scores and that are written 
in language that the test taker can understand. 

Standard 6.9 

Test documents should cite a representative 
set of the available studies pertaining to gen- 
eral and specific uses of the test. 

Comment: Summaries of cited studies — exclud- 
ing published works, dissertations, or propri- 
etary documents — should be made available 
on request to test users and researchers by the 
publisher. 

Standard 6.10 

Interpretive materials for tests, that include 
case studies, should provide examples illus- 
trating the diversity of prospective test takers. 

Comment: For some instruments, the presen- 
tation of case studies that are intended to 

69 


AERA APA NCME 0000079 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 81 of 100 


SUPPORTING DOCUMENTATION FOR TESTS / PART I 


assist the user in the interpretation of the test 
scores and profiles also will be appropriate for 
inclusion in the test documentation. For 
example, case studies might cite as appropri- 
ate examples of women and men of different 
ages; individuals differing in sexual orienta- 
tion; persons representing various ethnic, cul- 
tural, or racial groups; and individuals with 
special needs. The inclusion of examples illus- 
trating the diversity of prospective test takers 
is not intended to promote interpretation of 
test scores in a manner inconsistent with legal 
requirements that may restrict certain practices 
in some contexts, such as employee selection. 

Standard 6.11 

If a test is designed so that more than one 
method can be used for administration or 
for recording responses — such as marking 
responses in a test booklet, on a separate 
answer sheet, or on a computer keyboard — 
then the manual should dearly document the 
extent to which scores arising from these 
methods are interchangeable. If the results 
are not interchangeable, this fact should be 
reported, and guidance should be given for 
the interpretation of scores obtained under 
the various conditions or methods of 
administration. 

Standard 6.12 

Publishers and scoring services that offer 
computer-generated interpretations of test 
scores should provide a summary of the evi- 
dence supporting the interpretations given. 

Comment: The test user should be informed 
of any cut scores or configural rules necessary 
for understanding computer-generated score 
interpretations. A description of both the sam- 
ples used to derive cut scores or configural rules 
and the methods used to derive the cut scores 
should be provided. When proprietary inter- 
ests result in the withholding of cut scores or 
configural rules, the owners of the intellectual 


property are responsible for documenting evi- 
dence in support of the validity of computer- 
generated score interpretations. Such evidence 
might be provided, for example, by reporting 
the finding of an independent review of the 
algorithms by qualified professionals. 

Standard 6.13 

When substantial changes are made to a 
test, the test’s documentation should be 
amended, supplemented, or revised to keep 
information for users current and to provide 
useful additional information or cautions. 

Standard 6.14 

Every test form and supporting document 
should carry a copyright date or publication 
date. 

Comment: During the operational life of a test, 
new or revised test forms may be published, 
and manuals and other materials may be 
added or revised. Users and potential users 
are entitled to know the publication dates of 
various documents that include test norms. 
Communication among researchers is ham- 
pered when the particular test documents 
used in experimental studies are ambiguously 
referenced in research reports. 

Standard 6.15 

Test developers, publishers, and distributors 
should provide general information for test 
users and researchers who may be required 
to determine the appropriateness of an 
intended test use in a specific context. When 
a particular test use cannot be justified, the 
response to an inquiry from a prospective test 
user should indicate this fact clearly. General 
information also should be provided for test 
takers and legal guardians who must provide 
consent prior to a test’s administration. 


70 


AERA APA NCME 0000080 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 82 of 100 


PART II 

Fairness 
in Testing 


AERA APA NOME 0000081 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 83 of 100 


7. FAIRNESS IN TESTING AND 
TEST USE 


Background 

This chapter addresses overriding issues of 
fairness in testing. It is intended both to 
emphasize the importance of fairness in all 
aspects of testing and assessment and to serve 
as a context for the technical standards. Later 
chapters address in greater detail some fairness 
issues involving the responsibilities of test 
users, the rights and responsibilities of test 
takers, the testing of individuals of diverse lin- 
guistic backgrounds, and the testing of those 
with disabilities. Chapters 12 through 15 also 
address some fairness issues specific to psycho- 
logical, educational, employment and creden- 
tialing, and program evaluation applications 
of testing and assessment. 

Concern for fairness in testing is perva- 
sive, and the treatment accorded the topic 
here cannot do justice to the complex issues 
involved. A full consideration of fairness 
would explore the many functions of testing 
in relation to its many goals, including the 
broad goal of achieving equality of opportu- 
nity in our society. It would consider the 
technical properties of tests, the ways test 
results are reported, and the factors that are 
validly or erroneously thought to account 
for patterns of test performance for groups 
and individuals. A comprehensive analysis 
would also examine the regulations, statutes, 
and case law that govern test use and the 
remedies for harmful practices. The Standards 
cannot hope to deal adequately with all these 
broad issues, some of which have occasioned 
sharp disagreement among specialists and 
other thoughtful observers. Rather, the focus 
of the Standards is on those aspects of tests, 
testing, and test use that are the customary 
responsibilities of those who make, use, 
and interpret tests, and that are character- 
ized by some measure of professional and 
technical consensus. 


Absolute fairness to every examinee is 
impossible to attain, if for no other reasons 
than the facts that tests have imperfect relia- 
bility and that validity in any particular con- 
text is a matter of degree. But neither is any 
alternative selection or evaluation mechanism 
perfectly fair. Properly designed and used, 
tests can and do further societal goals of fair- 
ness and equality of opportunity. Serious 
technical deficiencies in test design, use, or 
interpretation should, of course, be addressed, 
but the fairness of testing in any given con- 
text must be judged relative to that of feasible 
test and nontest alternatives. It is general 
practice that large-scale tests are subjected to 
careful review and empirical checks to mini- 
mize bias. The amount of explicit attention to 
fairness in the design of well-made tests com- 
pares favorably to that of many alternative 
selection or evaluation methods. 

It is also crucial to bear in mind that test 
settings are interpersonal. The interaction of 
examiner with examinee should be profes- 
sional, courteous, caring, and respectful. In 
most testing situations, the roles of examiner 
and examinee are sharply unequal in status. A 
professional’s inferences and reports from test 
findings may markedly impact the life of the 
person who is examined. Attention to these 
aspects of test use and interpretation is no less 
important than more technical concerns. 

As is emphasized in professional educa- 
tion and training, users of tests should be 
alert to the possibility that human issues 
involving examiner and examinee may some- 
times affect test fairness. Attention to inter- 
personal issues is always important, perhaps 
especially so when examinees have a disability 
or differ from the examiner in ethnic, racial, 
or religious background; in gender or sexual 
orientation; in socioeconomic stacus; in age; 
or in other respects that may affect the exam- 
inee-examiner interaction. 
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Varying Views of Fairness 

The term fairness is used in many different ways 
and has no single technical meaning. It is pos- 
sible that two individuals may endorse fairness 
in testing as a desirable social goal, yet reach 
quite different conclusions about the fairness 
of a given testing program. Outlined below are 
four principal ways in which the term fairness 
is used. It should be noted, however, that 
many additional interpretations may be found 
in the technical and popular literature. 

The first two characterizations presented 
here relate fairness to absence of bias and to 
equitable treatment of all examinees in the 
testing process. There is broad consensus that 
tests should be free from bias (as defined 
below) and that all examinees should be treat- 
ed fairly in the testing process itself (e.g. , 
afforded the same or comparable procedures in 
testing, test scoring, and use of scores). The 
third characterization of test fairness addresses 
the equality of testing outcomes for examinee 
subgroups defined by race, ethnicity, gender, 
disability, or other characteristics. The idea that 
fairness requires equality in overall passing 
rates for different groups has been almost 
entirely repudiated in the professional testing 
literature. A more widely accepted view would 
hold that examinees of equal standing with 
respect to the construct the test is intended to 
measure should on average earn the same test 
score, irrespective of group membership. 
Unfortunately, because examinees’ levels of 
the construct are measured imperfectly, this 
requirement is rarely amenable to direct exami- 
nation. The fourth definition of fairness relates 
to equity in opportunity to learn the material 
covered in an achievement test. There would 
be general agreement that adequate opportuni- 
ty to learn is clearly relevant to some uses and 
interpretations of achievement tests and clearly 
irrelevant to others, although disagreement migjit 
arise as to the relevance of opportunity to learn 
to test fairness in some specific situations. 


Fairness as Lack of Bias 

Bias is used here as a technical term. It is 
said to arise when deficiencies in a test itself 
or the manner in which it is used result in 
different meanings for scores earned by mem- 
bers of different identifiable subgroups. When 
evidence of such deficiencies is found at the 
level of item response patterns for members 
of different groups, the terms item bias or dif- 
ferential item functioning (DIF) are often used. 
When evidence is found by comparing the 
patterns of association for different groups 
between test scores and other variables, the 
term predictive bias may be used. The concept 
of bias and techniques for its detection are 
discussed below and are also discussed in 
other chapters of the Standards. There is 
general consensus that consideration of bias 
is critical to sound testing practice. 

Fairness as Equitable Treatment in the Testing 
Process 

There is consensus that just treatment 
throughout the testing process is a necessary 
condition for test fairness. There is also con- 
sensus that fair treatment of all examinees 
requires consideration not only of a test itself, 
but also the context and purpose of testing 
and the manner in which test scores are used. 
A well-designed test is not intrinsically fair or 
unfair, but the use of the test in a particular 
circumstance or with particular examinees 
may be fair or unfair. Unfairness can have 
individual and collective consequences. 

Regardless of the purpose of testing, fair- 
ness requires that all examinees be given a 
comparable opportunity to demonstrate 
their standing on the construct(s) the test is 
intended to measure. Just treatment also 
includes such factors as appropriate testing 
conditions and equal opportunity to become 
familiar with the test format, practice materi- 
als, and so forth. In situations where individ- 
ual or group test results are reported, just 
treatment also implies that such reporting 
should be accurate and fully informative. 
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Fairness also requires that all examinees 
be afforded appropriate testing conditions. 
Careful standardization of tests and admin- 
istration conditions generally helps to assure 
that examinees have comparable opportuni- 
ty to demonstrate the abilities or attributes 
to be measured. In some cases, however, 
aspects of the testing process that pose no 
particular challenge for most examinees may 
prevent specific groups or individuals from 
accurately demonstrating their standing 
with respect to the construct of interest 
(e g , due to disability or language back- 
ground). In some instances, greater compa- 
rability may sometimes be attained if 
standardized procedures are modified. There 
are contexts in which some such modifica- 
tions are forbidden by law and other con- 
texts in which some such modifications are 
required by law. In all cases, standardized 
procedures should be followed for all exam- 
inees unless explicit, documented accommo- 
dations have been made. 

Ideally, examinees would also be afford- 
ed equal opportunity to prepare for a test. 
Examinees should in any case be afforded 
equal access to materials provided by the 
testing organization and sponsor which 
describe the test content and purpose and 
offer specific familiarization and preparation 
for test taking. In addition to assuring equi- 
ty in access to accepted resources for test 
preparation, this principle covers test securi- 
ty for nondisclosed tests. If some examinees 
were to have prior access to the contents of 
a secure test, for example, basing decisions 
upon the relative performance of different 
examinees would be unfair to others who 
did not have such access. On tests that have 
important individual consequences, all exam- 
inees should have a meaningful opportunity 
to provide input to relevant decision makers 
if procedural irregularities in testing are 
alleged, if the validity of the individual’s 
score is challenged or may not be reported, 
or if similar special circumstances arise. 


Finally, the conception of fairness as 
equitable treatment in the testing process 
extends to the reporting of individual and 
group test results. Individual test score infor- 
mation is entitled to confidential treatment in 
most circumstances. Confidentiality should 
be respected; scores should be disclosed only 
as appropriate. When test scores are reported, 
either for groups or individuals, score reports 
should be accurate and informative. It may 
be especially important when reporting 
results to nonprofessional audiences to use 
appropriate language and wording and to 
try to design reports to reduce the likelihood 
of inappropriate interpretations. When group 
achievement differences are reported, for 
example, including additional information to 
help the intended audience understand con- 
founding factors such as unequal educational 
opportunity may help to reduce misinterpre- 
tation of test results and increase the likeli- 
hood that tests will be used wisely. 

Fairness as Equality in Outcomes of Testing 

The idea that fairness requires overall 
passing rates to be comparable across groups 
is not generally accepted in the professional 
literature. Most testing professionals would 
probably agree that while group differences in 
testing outcomes should in many cases trigger 
heightened scrutiny for possible sources of 
test bias, outcome differences across groups 
do not in themselves indicate that a testing 
application is biased or unfair. It might be 
argued that when tests are used for selection, 
persons who all would perform equally well 
on the criterion measure if selected should 
have an equal chance of being chosen regard- 
less of group membership. Unfortunately, 
there is rarely any direct procedure for deter- 
mining whether this ideal has been met. 
Moreover, if score distributions differ from 
one group to another, it is generally impossi- 
ble to satisfy this ideal using any test that has 
a less than perfect correlation with the criteri- 
on measure. 
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Many testing professionals would agree 
that if a test is free of bias and examinees 
have received fair treatment in the testing 
process, then the conditions of fairness have 
been met. That is, given evidence of the 
validity of intended test uses and interpreta- 
tions, including evidence of lack of bias and 
attention to issues of fair treatment, fairness 
has been established regardless of group-level 
outcomes. This view need not imply that 
unequal testing outcomes should be ignored 
altogether. They may be important in gener- 
ating new hypotheses about bias and fair 
treatment. But in this view, unequal out- 
comes at the group level have no direct bear- 
ing on questions of test fairness. There may 
be legal requirements to investigate certain 
differences in outcomes of testing among sub- 
groups. Those requirements further may pro- 
vide that, other things being equal, a testing 
alternative that minimizes outcome differ- 
ences across relevant subgroups should be 
used. The standards in this chapter are 
intended to be applied in a manner consistent 
with legal and regulatory standards. 

Fairness as Opportunity to Learn 

This final conception of fairness arises in 
connection with educational achievement test- 
ing. In many contexts, achievement tests are 
intended to assess what a test taker knows or 
can do as a result of formal instruction. When 
some test takers have not had the opportunity 
to learn the subject matter covered by the test 
content, they are likely to get low scores. The 
test score may accurately reflect what the test 
taker knows and can do, but low scores may 
have resulted in part from not having had the 
opportunity to learn the material tested as well 
as from having had the opportunity and having 
foiled to learn. When test takers have not had 
the opportunity to learn the material tested, the 
policy of using their test scores as a basis for 
withholding a high school diploma, for exam- 
ple, is viewed as unfair. This issue is further dis- 
cussed in chapter 13, on educational testing. 


At least three important difficulties arise 
with this conception of fairness. First, the 
definition of opportunity to learn is difficult in 
practice, especially at the level of individuals. 
Opportunity is a matter of degree. Moreover, 
the measurement of some important learning 
outcomes may require students to work with 
material they have not seen before. Second, 
even if it is possible to document the topics 
included in the curriculum for a group of stu- 
dents, specific content coverage for any one 
student may be impossible to determine. 
Finally, there is a well-founded desire to 
assure that credentials attest to certain profi- 
ciencies or capabilities. Granting a diploma to 
a low-scoring examinee on the grounds that 
the student had insufficient opportunity to 
learn the material tested means certificating 
someone who has not attained the degree of 
proficiency the diploma is intended to signify. 

It should be noted that opportunity to 
learn ordinarily plays no role in determining 
the fairness of tests used for employment and 
credentialing, which are covered in chapter 
14, nor of admissions testing. In those cir- 
cumstances, it is deemed fair that the test 
should cover the full range of requisite 
knowledge and skills. However, there are situ- 
ations in which the agency that determines 
the contents of a test used for employment or 
credentialing also sets the curriculum that 
must be followed in preparing to take the 
test. In such cases, it is the responsibility of 
that agency to assure that what is to be tested 
is fully included in the specification of what 
is to be taught. 

Bias Associated With Test Content 
and Response Processes 

The term bias in tests and testing refers to 
construct-irrelevant components that result 
in systematically lower or higher scores for 
identifiable groups of examinees. Such con- 
struct-irrelevant score components may be 
introduced due to inappropriate sampling of 
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test content or lack of clarity in test instruc- 
tions. They may also arise if scoring criteria 
fail to credit fully some cortect problem 
approaches or solutions that are more typi- 
cal of one group than another. Evidence of 
these potential sources of bias may be 
sought in the content of the tests, in com- 
parisons of the internal structure of test 
responses for different groups, and in com- 
parisons of the relationships of test scores 
to other measures, although none of these 
types of evidence is unequivocal. 

Content-Related Sources of Test Bias 

Bias due to inappropriate selection of 
test content may sometimes be detected by 
inspection of the test itself. In some testing 
contexts, it is common For test developers to 
engage an independent panel of diverse 
experts to review test content for language 
that might be interpreted differently by mem- 
bers of different groups and for material that 
might be offensive or emotionally disturbing 
to some test takers. For performance assess- 
ments, panels are often engaged to review 
the scoring rubric as well. A test intended to 
measure verbal analogical reasoning, for 
example, should include words in general use, 
not words and expressions associated with 
particular disciplines, occupations, ethnic 
groups, or locations. Where materia] likely 
to be differentially interesting or relevant to 
some examinees is included, it may be bal- 
anced by material that may be of particular 
interest to the remaining examinees. 

In educational achievement testing, 
alignment with curriculum may bear on ques- 
tions of content-related test bias. One may 
ask how well a test represents some content 
domain and also whether that domain is 
appropriate given intended score interpreta- 
tions. A test of 19th-century United States 
history might give considerable emphasis to 
the War of 1812, the Mexican War, the Civil 
War, and the Spanish American War. If some 
state’s curriculum framework dealt relatively 


lightly with these wars, devoting more atten- 
tion instead, say, to social and industrial 
developments, then that state’s test takers 
might be relatively disadvantaged. 

Bias may also result from a lack of clarity 
in test instructions or from scoring rubrics 
that credit responses more typical of one 
group than another. For example, cognitive 
ability tests often require test takers to classify 
objects according to an unspecified rule. If a 
given task credits classification on the basis of 
the stimulus objects’ functions, but an identi- 
fiable subgroup of examinees tends to classify 
the objects on the basis of their physical 
appearance, faulty test interpretations are 
likely. Similarly, if the scoring rubric for a 
constructed response item reserves the highest 
score level for those examinees who in fact 
provide more information or elaboration than 
was actually requested, then less test-wise 
examinees who simply follow instructions will 
earn lower scores. In this case, testwiseness 
becomes a construct-irrelevant component 
of test scores. 

Judgmental methods for the review of 
tests and test items are often supplemented by 
statistical procedures for identifying items on 
tests that function differently across identifi- 
able subgroups of examinees. Differential 
item functioning (DIF) is said to exist when 
examinees of equal ability differ on average, 
according to their group membership, in their 
responses to a particular item. If examinees 
from each group are divided into subgroups 
according to the rested ability and subgroups 
at the same ability level have unequal proba- 
bilities of answering a given item correctly, 
then there is evidence that that item may not 
be functioning as intended. It may be meas- 
uring something different from the remainder 
of the test or it may be measuring with differ- 
ent levels of precision for different subgroups 
of examinees. Such an item may offer a valid 
measurement of some narrow element of the 
intended construct, or it may tap some con- 
struct-irrelevant component that advantages 
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or disadvantages members of one group. 
Although DIF procedures may hold some 
promise for improving test quality, there has 
been little progress in identifying the causes 
or substantive themes that characterize items 
exhibiting DIF. That is, once items on a test 
have been statistically identified as function- 
ing differently from one examinee group to 
another, it has been difficult to specify the 
reasons for the differential performance or 
to identify a common deficiency among the 
identified items. 

Response-Related Sources of Test Bias 

In some cases, construct-irrelevant score 
components may arise because test items elic- 
it varieties of responses other than those 
intended or can be solved in ways that were 
not intended. For example, clients responding 
to a diagnostic inventory may attempt to pro- 
vide the answers they think the test adminis- 
trator expects as opposed to the answers that 
best describe themselves. To the extent that 
such response acquiescence is more typical 
of some groups than others, bias may result. 
Bias may also be associated with test response 
formats that pose particular difficulties for 
one group or another. For example, test per- 
formance may rely on some capability (e.g., 
English language proficiency or fine-motor 
coordination) that is irrelevant to the intent 
of the measurement but nonetheless poses 
impediments for some examinees. A test of 
quantitative reasoning that makes inappropri- 
ately heavy demands on verbal ability would 
probably be biased against examinees whose 
first language is other than that of the test. 

In addition to content reviews and DIF 
analyses, evidence of bias related to response 
processes may be provided by comparisons of 
the internal structure of the test responses for 
different groups of examinees. If an analysis 
of the factors or dimensions underlying test 
performance reveals different internal struc- 
tures for different groups, it may be that dif- 
ferent constructs are being measured or it 


may simply be that groups differ in their vari- 
ability with respect to the same underlying 
dimensions. When there is evidence that 
tests, including personality tests, measure dif- 
ferent constructs in different gender, racial, or 
cultural groups, it is important to determine 
that the internal structure of the test supports 
inferences made for clients from these distinct 
subgroups of the client population. In situa- 
tions where internal test structure varies 
markedly across ethnically diverse cultures, it 
may be inappropriate to make direct compar- 
isons of scores of members of these different 
cultural groups. 

Bias may also be indicated by patterns 
of association between test scores and other 
variables. Perhaps the most familiar form 
such evidence may take is a difference across 
groups in the regression equations relating 
selection test performance to criterion per- 
formance. This case is discussed at greater 
length in the following section. However, 
evidence of bias based on relations to other 
variables may also take many other forms. 
The relationship between two tests of the 
same cognitive ability might be found to dif- 
fer from one group to another, for example. 
Such a difference might indicate bias in one 
or both tests. As another instance, a higher 
than expected association between reading 
and mathematics achievement test scores 
among students who might well have limit- 
ed English proficiency could trigger an 
investigation to determine whether language 
proficiency was influencing some examinees’ 
mathematics scores. Patterns of score aver- 
ages or other distributional summaries might 
also point to potential sources of test bias. If 
males outperformed females on one measure 
of academic performance and, in the same 
population, females outperformed males on 
another, it would follow that the two meas- 
ures could not both be linearly related to the 
identical underlying construct. Note, howev- 
er, that if the tested populations differed, if 
the content domains sampled differed, or if 
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the constructs tested otherwise differed due 
to varying motivational contexts or other 
effects, two reliable tests, each valid for its 
intended purpose, might show such a pat- 
tern. Association need not imply any direct 
or causal linkage, and alternative explana- 
tions for patterns of association should 
usually be considered. In some cases, a test- 
criterion correlation may arise because the 
test and criterion both depend on the same 
conscruct-irrelevant ability. If identifiable 
subgroups differ with respect to that extra- 
neous ability, then bias may result. 

Fairness in Selection and 
Prediction 

When tests are used for selection and predic- 
tion, evidence of bias or lack of bias is gener- 
ally sought in the relationships between test 
and criterion scores for the respective groups. 
Under one broadly accepted definition, no 
bias exists if the regression equations relating 
the test and the criterion are indistinguishable 
for the groups in question. (Some formula- 
tions may hold that not only regression slopes 
and intercepts but also standard errors of 
estimate must be equal.) If test-criterion 
relationships differ, different decision rules 
may be followed depending on the group 
to which the person belongs. 

If fitting a common prediction equation 
for all groups combined suggests that the cri- 
terion performance of persons in any one 
group is systematically overpredicted or 
underpredicted, and if bias in the criterion 
measure has been set aside as a possible 
explanation, one possibility is to generate a 
separate prediction formula for each group. 
Another possibility is to seek predictor vari- 
ables that may be used in lieu of or in addi- 
tion to the initial predictor score to reduce 
differential prediction without reducing over- 
all predictive accuracy. If separate regression 
equations arc employed, the effect of their 
use on the distribution of predicted criterion 


scores for the different groups should be 
examined. Note that in the United States, the 
use of different selection rules for identifiable 
subgroups of examinees is legally proscribed 
in some contexts. There may, however, be 
legal requirements to consider alternative 
selection procedures in some such situations. 

There is often tension between the per- 
spective that equates fairness with lack of 
bias, in the technical sense, and the perspec- 
tive that focuses on testing outcomes. A test 
that is valid for its intended purpose might be 
considered fair if a given test score predicts 
the same performance level for members of 
all groups. It might nonetheless be regarded 
by some as unfair, however, if average test 
scores differ across groups. This is because a 
given selection score and criterion threshold 
will often result in proportionately more false 
negative decisions in groups with lower mean 
test scores. In other words, a lower-scoring 
group will usually have a higher proportion 
of examinees who are rejected on the basis 
of their test scores even though they would 
have performed successfully if they had been 
selected. This seeming paradox is a statistical 
consequence of the imperfect correlation 
between test and criterion. It does not occur 
because of any other property of the test and 
has no direct relationship to group demo- 
graphics. It is a purely statistical phenomenon 
that occurs as a function of lower test scores, 
regardless of group membership. For exam- 
ple, it usually occurs when the top and bot- 
tom test score halves of the majority group 
are compared. The fairness of a test or 
another predictor should be evaluated rela- 
tive to that of nontest alternatives that 
might be used instead. 

Group Outcome Differences Due to Choice of 
Predictors 

Success in virtually all real-world 
endeavors requires multiple skills and abili- 
ties, which may interact in complex ways. 
Testing programs typically address only a 
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subset of these. Some skills and abilities are 
excluded because chey are assessed in other 
components of the selection process (e.g., 
completion of course work or an interview); 
others may be excluded because reliable and 
valid measurement is economically, logisti- 
cally, or administratively infeasible. Success 
in college, for example, requires persever- 
ance, motivation, good study habits, and a 
host of other factors in addition to verbal 
and quantitative reasoning ability. Even if 
each of the criteria employed in a selection 
process is demonstrably valid and appropri- 
ate for that purpose, issues of fairness may 
arise in the choice of which factors are 
measured. If identifiable groups differ in 
their average levels of measured versus 
unmeasured job-relevant characteristics, 
then fairness becomes a concern at the 
group level as well as the individual level. 

Can Consensus Be Achieved? 

It is unlikely that consensus in society at 
large or within the measurement communi- 
ty is imminent on all matters of fairness in 
the use of tests. As noted earlier, fairness is 
defined in a variety of ways and is not 
exclusively addressed in technical terms; it is 
subject to different definitions and interpre- 
tations in different social and political cir- 
cumstances. According to one view, the 
conscientious application of an unbiased 
test in any given situation is fair, regardless 
of the consequences for individuals or 
groups. Others would argue that fairness 
requires more than satisfying certain techni- 
cal requirements. It bears repeating that 
while the Standards will provide more spe- 
cific guidance on matters of technical ade- 
quacy, matters of values and public policy 
are crucial to responsible test use. 


Standard 7.1 

When credible research reports that test 
scores differ in meaning across examinee 
subgroups for the type of test in question, 
then to the extent feasible, the same forms 
of validity evidence collected for the exam- 
inee population as a whole should also be 
collected for each relevant subgroup. 
Subgroups may be found to differ with 
respect to appropriateness of test content, 
internal structure of test responses, the 
relation of test scores to other variables, or 
the response processes employed by indi- 
vidual examinees. Any such findings should 
receive due consideration in the interpreta- 
tion and use of scores as well as in subse- 
quent test revisions. 

Comment: Scores differ in meaning across 
subgroups when the same score produces 
systematically different inferences about 
examinees who are members of different 
subgroups. In those circumstances where 
credible research reports differences in score 
meaning for particular subgroups for the type 
of test in question, this standard calls for 
separate, parallel analyses of data for members 
of those subgroups, sample sizes permitting. 
Relevant examinee subgroups may be defined 
by race or ethnicity, culture, language, gender, 
disability, age, socioeconomic status, or other 
classifications. Not all forms of evidence can 
be examined separately for members of all 
such groups. The validity argument may rely 
on existing research literature, for example, 
and such literature may not be available for 
some populations. For some kinds of evi- 
dence, some separate subgroup analyses may 
not be feasible due to the limited number 
of cases available. Data may sometimes be 
accumulated so that these analyses can be 
performed after the test has been in use for a 
period of time. This standard is not satisfied 
by assuring that such groups are represented 
within larger, pooled samples, although this 
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may also be important. In giving “due con- 
sideration in the interpretation and use of 
scores,” pursuant to this standard, test users 
should be mindful of legal restrictions that 
may prohibit or limit within-group scoring 
and other practices. 

Standard 7.2 

When credible research reports differences 
in the effects of construct-irrelevant variance 
across subgroups of test takers on perform- 
ance on some part of the test, the test 
should be used if at all only for those 
subgroups for which evidence indicates 
that valid inferences can be drawn from 
test scores. 

Comment : An obvious reason why a test 
may not measure the same constructs across 
subgroups is that different components come 
into play from one subgroup to another. 
Alternatively, an irrelevant component may 
have a more significant efFect on the perform- 
ance of examinees in one subgroup than in 
another. Such intrusive elements are rarely 
entirely absent for any subgroup but are sel- 
dom present to any great extent. The decision 
whether or not to use a test with any given 
examinee subgroup necessarily involves a 
careful analysis of the validity evidence for 
different subgroups, as called for in Standard 
7. 1 , and the exercise of thoughtful profession- 
al judgment regarding the significance of the 
irrelevant components. 

A conclusion that a test is not appro- 
priate for a particular subgroup requires 
an alternative course of action. This may 
involve a search for a test that can be used 
for all groups or, in circumstances where it 
is feasible to use different construct-equiva- 
lent tests for different groups, for an alter- 
native test for use in the subgroup for 
which the intended construct is not well 
measured by the current test. In some cases 
multiple tests may be used in combination. 


and a composite that permits valid infer- 
ences across subgroups may be identified. 
In some circumstances, such as employment 
testing, there may be legal or other con- 
straints on the use of different tests for 
different subgroups. 

It is acknowledged that there are 
occasions where examinees may request or 
demand to take a version of the test other 
than that deemed most appropriate by the 
developer or user. An individual with a 
disability may decline an alternate form 
and request the standard form. Acceding 
to this request, after ensuring that the 
examinee is fully informed about the test 
and how it will be used, is not a violation 
of this standard. 

Standard 7.3 

When credible research reports that differ- 
ential item functioning exists across age, 
gender, racial/ethnic, cultural, disability, 
and/or linguistic groups in the population 
of test takers in the content domain meas- 
ured by the test, test developers should 
conduct appropriate studies when feasible. 
Such research should seek to detect and 
eliminate aspects of test design, content, 
and format that might bias test scores for 
particular groups. 

Comment: Differential item functioning 
exists when examinees of equal ability 
differ, on average, according to their group 
membership in their responses to a particu- 
lar item. In some domains, existing research 
may indicate that differential item function- 
ing occurs infrequently and does not repli- 
cate across samples. In others, research 
evidence may indicate that differential item 
functioning occurs reliably at meaningful 
above-chance levels for some particular 
groups; it is to such circumstances that the 
standard applies. Although it may not be 
possible prior to first release of a test to 
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study the question of differential item 
functioning for some such groups, contin- 
ued operational use of a test may afford 
opportunities to check for differential 
item functioning. 

Standard 7.4 

Test developers should strive to identify 
and eliminate language, symbols, words, 
phrases, and content that are generally 
regarded as offensive by members of racial, 
ethnic, gender, or other groups, except 
when judged to be necessary for adequate 
representation of the domain. 

Comment: Two issues are involved. The first 
deals with the inadvertent use of language 
that, unknown to the test developer, has a 
different meaning or connotation in one 
subgroup than in others. Test publishers 
often conduct sensitivity reviews of all test 
material to detect and remove sensitive 
material from the test. The second deals 
with settings in which sensitive material is 
essential for validity. For example, history 
tests may appropriately include material on 
slavery or Nazis. Tests on subjects from the 
life sciences may appropriately include 
material on evolution. A test of under- 
standing of an organization’s sexual harass- 
ment policy may require employees to 
evaluate examples of potentially offensive 
behavior. 

Standard 7.5 

In testing applications involving individu- 
alized interpretations of test scores other 
than selection, a test taker’s score should 
not be accepted as a reflection of standing 
on the characteristic being assessed with- 
out consideration of alternate explanations 
for the test taker’s performance on that test 
at that time. 


Comment: Many test manuals point out 
variables that should be considered in inter- 
preting test scores, such as clinically relevant 
history, school record, vocational status, and 
test-taker motivation. Influences associated 
with variables such as socioeconomic status, 
ethnicity, gender, cultural background, lan- 
guage, or age may also be relevant. In addi- 
tion, medication, visual impairments, or 
other disabilities may affect a test taker’s 
performance on, for example, a paper-and- 
pencil test of mathematics. 

Standard 7.6 

When empirical studies of differential pre- 
diction of a criterion for members of dif- 
ferent subgroups are conducted, they 
should include regression equations (or 
an appropriate equivalent) computed sepa- 
rately for each group or treatment under 
consideration or an analysis in which the 
group or treatment variables are entered 
as moderator variables. 

Comment : Correlation coefficients provide 
inadequate evidence for or against a differ- 
ential prediction hypothesis if groups or 
treatments are found not to be approxi- 
mately equal with respect to both test 
and criterion means and variances. 
Considerations of both regression slopes 
and intercepts are needed. For example, 
despite equal correlations across groups, 
differences in intercepts may be found. 

Standard 7.7 

In testing applications where the level of 
linguistic or reading ability is not part of 
the construct of interest, the linguistic or 
reading demands of the test should be kept 
to the minimum necessary for the valid 
assessment of the intended construct. 
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Comment: When the intent is to assess ability 
in mathematics or mechanical comprehen- 
sion, for example, the test should not con- 
tain unusual words or complicated syntactic 
conventions unrelated to the mathematical 
or mechanical skill being assessed. 

Standard 7.8 

When scores are disaggregated and pub- 
licly reported for groups identified by 
characteristics such as gender, ethnicity, 
age, language proficiency, or disability, 
cautionary statements should be included 
whenever credible research reports that test 
scores may not have comparable meaning 
across these different groups. 

Comment: Comparisons across groups ate 
only meaningful if scores have comparable 
meaning across groups. The standard is 
intended as applicable to settings where 
scores are implicidy or explicitly presented as 
comparable in score meaning across groups. 

Standard 7.9 

When tests or assessments are proposed 
for use as instruments of social, education- 
al, or public policy, the test developers or 
users proposing the test should fully and 
accurately inform policymakers of the 
characteristics of the tests as well as any 
relevant and credible information that may 
be available concerning the likely conse- 
quences of test use. 

Standard 7.10 

When the use of a test results in outcomes 
that affect the life chances or educational 
opportunities of examinees, evidence of 
mean test score differences between rele- 
vant subgroups of examinees should, 
where feasible, be examined for subgroups 
for which credible research reports mean 
differences for similar tests. Where mean 


differences are found, an investigation 
should be undertaken to determine that 
such differences are not attributable to a 
source of construct underrepresentation 
or construct-irrelevant variance. While 
initially the responsibility of the test 
developer, the test user bears responsibility 
for uses with groups other than those 
specified by the developer. 

Comment: Examples of such test uses 
include situations in which a test plays a 
dominant role in a decision to grant or 
withhold a high school diploma or to pro- 
mote a student or retain a student in grade. 
Such an investigation might include a 
review of the cumulative research literature 
or local studies, as appropriate. In some 
domains, such as cognitive ability testing 
in employment, a substantial relevant 
research base may preclude the need for 
local studies. In educational settings, as dis- 
cussed in chapter 13, potential differences 
in opportunity to learn may be relevant as 
a possible source of mean differences. 

Standard 7.11 

When a construct can be measured in dif- 
ferent ways that are approximately equal 
in their degree of construct representation 
and freedom from construct-irrelevant 
variance, evidence of mean score differ- 
ences across relevant subgroups of exam- 
inees should be considered in deciding 
which test to use. 

Comment: Mean score differences, while 
important, are but one factor influencing 
the choice between one test and another. 
Cost, testing time, test security, and logistic 
issues (e.g., an application where very large 
numbers of examinees must be screened in 
a very short time) are among the issues also 
entering into the professional judgment 
about test use. 
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Standard 7.12 

The testing or assessment process should 
be carried out so that test takers receive 
comparable and equitable treatment dur- 
ing all phases of the testing or assessment 
process. 

Comment: For example, should a person 
administering a test or interpreting test 
results recognize a personal bias for or 
against an examinee, or for or against an y 
subgroup of which the examinee is a mem- 
ber, the person could take a variety of steps 
ranging from seeking a review of test inter- 
pretations from a colleague to withdrawal 
from the testing process. 
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Background 

This chapter addresses fairness issues unique 
to the interests of the individual test taker. 
Fair treatment of test takers is not only a mat- 
ter of equity, but also promotes the validity 
and reliability of the inferences made from 
the test performance. The standards presented 
in this chapter reflect widely accepted princi- 
ples in the field of measurement. The stan- 
dards address the responsibilities of test takers 
with regard to test security, their access to test 
results, and their rights when irregularities in 
their testing are claimed. Other issues of fair- 
ness are treated in other chapters: general 
principles in chapter 7; the testing of linguis- 
tic minorities in chapter 9; the testing of per- 
sons with disabilities in chapter 10. General 
considerations concerning reports of test 
results are covered in chapter 5. 

Test takers have the right to be assessed 
with tests that meet current professional stan- 
dards, including standards of technical quali- 
ty, fairness, administration, and reporting of 
results. Fair and equitable treatment of test 
takers involves providing, in advance of test- 
ing, information about the nature of the test, 
the intended use of test scores, and the confi- 
dentiality of the results. Test takers, or their 
legal representatives when appropriate, need 
enough information about the test and the 
intended use of test results to reach a compe- 
tent decision about participating in testing. 
In some instances, formal informed consent 
for testing is required by law or by other stan- 
dards of professional practice, such as those 
governing research on human subjects. The 
greater the consequences to the test taker, 
the greater the importance of ensuring that 
the test taker is fully informed about the test 
and voluntarily consents to participate, 
except when testing without consent is per- 
mitted by law. If a test is optional, the test 


taker has the right to know the consequences 
of taking or not taking the test. The test 
taker has the right to acceptable opportuni- 
ties for asking questions or expressing con- 
cerns, and may expect timely responses to 
legitimate questions. 

Where consistent with the purposes 
and nature of the assessment, general infor- 
mation is usually provided about the test’s 
content and purposes. Some programs, in 
the interests of fairness, provide all test tak- 
ers with helpful materials, such as study 
guides, sample questions, or complete sam- 
ple tests, when such information does not 
jeopardize the validity of the results from 
future test administration. Advice may also 
be provided about test-taking strategies, 
including time management, and the advis- 
ability of omitting an item response, when 
it is permitted. Information is made known 
about the availability of special accommoda- 
tions for those who need them. The policy 
on retesting may be stated, in case the test 
taker feels that the present performance 
does not appropriately reflect his/her best 
performance. 

As participants in the assessment, test 
takers have responsibilities as well as rights. 
Their responsibilities include preparing them- 
selves for the test, following the directions of 
the test administrator, representing them- 
selves honestly on the test, and informing 
appropriate persons if they believe the test 
results do not adequately reflect them. In 
group testing situations, test takers are expect- 
ed not to interfere with the performance of 
other test takers. 

Test validity rests on the assumption 
that a test taker has earned fairly a particu- 
lar score or pass/fail decision. Any form of 
cheating, or other behavior that reduces the 
fairness and validity of a test, is irresponsi- 
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ble, is unfair to other test takers and may 
lead to sanctions. It is unfair for a test taker 
to use aids that are prohibited. It is unfair 
for a test taker to arrange for someone else 
to take the test in his/her place. The test taker 
is obligated to respect the copyrights of the 
test publisher or sponsor on all test materials. 
This means that the test taker will not repro- 
duce the items without authorization nor 
disseminate, in any form, material that is 
clearly analogous to the reproduction of the 
items. Test takers, as well as test administra- 
tors, have the responsibility not to compro- 
mise security by divulging any details of the 
test items to others nor may they request 
such details from others. Failure to honor 
these responsibilities may compromise the 
validity of test score interpretations for 
themselves and for others. 

Sometimes, testing programs use special 
scores, statistical indicators, and other 
indirect information about irregularities in 
testing to help ensure that the test scores 
are obtained fairly. Unusual patterns of 
responses, large changes in test scores upon 
retesting, speed of responding, and similar 
indicators may trigger careful scrutiny of 
certain testing protocols. The details of 
these procedures are generally kept secure 
to avoid compromising their use. However, 
test takers can be made aware that in special 
circumstances, such as response or test score 
anomalies, their test responses may get 
special scrutiny. If evidence of impropriety 
or fraud so warrants, the test taker’s score 
may be canceled, or other action taken. 

Because these Standards are directed 
to test providers, and not to test takers, 
standards about test-taker responsibilities 
are phrased in terms of providing informa- 
tion to test takers about their rights and 
responsibilities. Providing this information 
is the joint responsibility of the test devel- 
oper, the test administrator, the test proctor, 
if any, and the test user and may be appor- 
tioned according to particular circumstances. 


Standard 8.1 

Any information about test content and 
purposes that is available to any test taker 
prior to testing should be available to all 
test takers. Important information should 
be available free of charge and in accessi- 
ble formats. 

Comment: The intent of this standard is 
equal treatment for all. Important informa- 
tion would include that necessary for test- 
ing, such as when and where the test is 
given, what material should be brought, 
the purpose of the test, and so forth. More 
detailed information, such as practice mate- 
rials, is sometimes offered for a fee. Such 
offerings should be made to all test takers. 

Standard 8.2 

Where appropriate, test takers should be 
provided, in advance, as much information 
about the test, the testing process, the 
intended test use, test scoring criteria, 
testing policy, and confidentiality protec- 
tion as is consistent with obtaining valid 
responses. 

Comment: Where appropriate, test takers 
should be informed, possibly by a test bul- 
letin or similar procedure, about test con- 
tent, including subject area, topics covered, 
and item formats. They should be informed 
about the advisability of omitting responses. 
They should be aware of any imposed time 
limits, so that they can manage their time 
appropriately. General advice should be 
given about test-taking strategy. In computer 
administrations, they should be told 
about any provisions for review of items 
they have previously answered or omitted. 
Test takers should understand the intended 
use of test scores and the confidentiality of 
test results. They should be advised whether 
they will have access to their results. They 
should be informed about the policy con- 
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cerning taking the test again and about 
the possibility that some test protocols 
may receive special scrutiny for security 
reasons. Test takers should be informed 
about the consequences of misconduct or 
improper behavior, such as cheating, that 
could result in their being prohibited from 
completing the test, receiving test scores, 
or other sanctions. 

Standard 8.3 

When the test taker is offered a choice of 
test format, information about the charac- 
teristics of each format should be provided. 

Comment: Test takers sometimes have to 
choose between a paper-and-pencil admi- 
nistration and a computer-administered 
test, which may be adaptive. Some tests 
are offered in several different languages. 
Sometimes an alternative assessment is 
offered in lieu of the ordinary test. Test 
takers need to know the characteristics of 
each alternative so that they can make an 
informed choice. 

Standard 8.4 

Informed consent should be obtained from 
test takers, or their legal representatives 
when appropriate, before testing is done 
except (a) when testing without consent 
is mandated by law or governmental regu- 
lation, (b) when testing is conducted as 
a regular part of school activities, or (c) 
when consent is clearly implied. 

Comment: Informed consent implies that 
the test takers or representatives are made 
aware, in language that they can under- 
stand, of the reasons for testing, the type 
of tests to be used, the intended use, and 
the range of material consequences of 
che intended use. If written, video, or 
audio records are made of the testing ses- 
sion, or other records are kept, test takers 


are entitled to know what testing informa- 
tion will be released and to whom. Consent 
is not required when testing is legally man- 
dated, such as a court-ordered psychological 
assessment, but there may be legal require- 
ments for providing information. When 
testing is required for employment or for 
educational admissions, applicants, by 
applying, have implicitly given consent to 
the testing. Nevertheless, test takers and / 
or their legal representatives should be 
given appropriate information about a test 
when it is in their interest to be informed. 
Young test takers should receive an explana- 
tion of the reasons for testing. Even a child 
as young as two or three, as well as older 
test takers of limited cognitive ability, can 
understand a simple explanation as to why 
they are being tested (such as, “I’m going 
to ask you to try to do some things so 
that I can see what you know how to do 
and what things you could, use some more 
help with”). 

Standard 8.5 

Test results identified by the names of 
individual test takers, or by other perso- 
nally identifying information, should be 
released only to persons with a legitimate, 
professional interest in the test taker or 
who are covered by the informed consent 
of the test taker or a legal representative, 
unless otherwise required by law. 

Comment: Scores of individuals identified 
by name, or by some other means by which 
a person can be readily identified, such as 
social security number, should be kept con- 
fidential. In some situations, information 
may be provided on a confidential basis to 
other practitioners with a legitimate interest 
in the particular case, consistent with legal 
and ethical considerations. Information 
may be provided to researchers if a test 
taker’s anonymity is maintained and the 
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intended use is consistent with accepted 
research practice and is not inconsistent 
with the conditions of the test taker’s 
informed consent. 

Standard 8.6 

Test data maintained in data files should 
be adequately protected from improper 
disclosure. Use of facsimile transmission, 
computer networks, data banks, and other 
electronic data processing or transmittal 
systems should be restricted to situa- 
tions in which confidentiality can be 
reasonably assured. 

Comment: When facsimile or computer 
communication is used to transmit a test 
protocol to another site for scoring, or if 
scores are similarly transmitted, special pro- 
visions should be made to keep the infor- 
mation confidential. See Standard 5.13. 

Standard 8.7 

Test takers should be made aware that 
having someone else take the test for 
them, disclosing confidential test materi- 
al, or any other form of cheating is inap- 
propriate and that such behavior may 
result in sanctions. 

Comment: Although the standards cannot 
regulate the behavior of test takers, test 
takers should be made aware of their per- 
sonal and legal responsibilities. Arranging 
for someone else to impersonate the nom- 
inal test taker constitutes fraud. Disclosure 
of confidential testing material for the pur- 
pose of giving other test takers pre-knowl- 
edge is unfair and may constitute copyright 
infringement. In licensure and certification 
tests, such actions may compromise public 
health and safety. The validity of test score 
interpretations is compromised by inappro- 
priate test disclosure. 


Standard 8.8 

When score reporting includes assigning 
individuals to categories, the categories 
should be chosen carefully and described 
precisely. The least stigmatizing labels, 
consistent with accurate representation, 
should always be assigned. 

Comment: When labels are associated with 
test results, care should be taken to be pre- 
cise in the meanings associated with the 
labels and to avoid unnecessarily stigmatiz- 
ing consequences associated with a label. 
For example, in an assessment designed to 
aid in determining whether an individual is 
competent to stand trial, the label “incom- 
petent” is appropriate for individuals who 
perform poorly on the assessment. However, 
in a test of basic literacy skills, it is more 
appropriate to use a label such as “not pro- 
ficient” rather than “incompetent,” because 
the latter term has a more global and 
derogatory meaning. 

Standard 8.9 

When test scores are used to make deci- 
sions about a test taker or to make recom- 
mendations to a test taker or a third party, 
the test taker or the legal representative is 
entitled to obtain a copy of any report of 
test scores or test interpretation, unless 
that right has been waived or is prohibited 
by law or court order. 

Comment: In some cases a test taker may be 
adequately informed when the test report is 
given to an appropriate third party (treating 
psychologist or psychiatrist) who can inter- 
pret the findings to the test taker. In profes- 
sional applications of individualized testing, 
when the test taker is given a copy of the 
test report, the examiner or a knowledgeable 
third party should be available to interpret 
it, even if it is clearly written, as the test 
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taker may misunderstand or raise questions 
not specifically answered in the report. In 
employment testing situations, where test 
results are used solely for the purpose of 
aiding selection decisions, waivers of access 
are often a condition of employment, 
although access to test information may 
often be appropriately required in other 
circumstances. 

Standard 8.10 

In educational testing programs and in 
licensing and certification applications, 
when an individual score report is expected 
to be delayed beyond a brief investigative 
period, because of possible irregularities 
such as suspected misconduct, the test 
taker should be notified, the reason given, 
and reasonable efforts made to expedite 
review and to protect the interests of the 
test taker. The test taker should be noti- 
fied of the disposition, when the investi- 
gation is closed. 

Standard 8.11 

In educational testing programs and in 
licensing and certification applications, 
when it is deemed necessary to cancel or 
withhold a test takers score because of pos- 
sible testing irregularities, including sus- 
pected misconduct, the type of evidence 
and procedures to be used to investigate 
the irregularity should be explained to all 
test takers whose scores axe direcdy affected 
by the decision. Test takers should be given 
a timely opportunity to provide evidence 
that the score should not be canceled or 
withheld. Evidence considered in deciding 
upon the final action should be made avail- 
able to the test taker on request. 

Comment: Any form of cheating or behavior 
that reduces the validity and fairness of test 
results should be investigated promptly, and 


appropriate action taken. Withholding or 
canceling a test score may arise because of 
suspected misconduct by the test taker, or 
because of some anomaly involving others, 
such as theft, or administrative mishap. An 
avenue of appeal should be available and 
made known to candidates whose scores 
may be amended or withheld. Some testing 
organizations offer the option of a prompt 
and free retest or arbitration of disputes. 

Standard 8.12 

In educational testing programs and in 
licensing and certification applications, 
when testing irregularities are suspected, 
reasonably available information bearing 
direcdy on the assessment should be con- 
sidered, consistent with the need to pro- 
tect the privacy of test takers. 

Comment: Unless allegations of misconduct 
are made by associates of the test taker, the 
information to be collected would ordinari- 
ly be limited to that obtainable without 
invading the privacy of the test taker or 
his/her associates. 

Standard 8.13 

In educational testing programs and in 
licensing and certification applications, 
test takers are entitled to fair considera- 
tion and reasonable process, as appropriate 
to the particular circumstances, in resolv- 
ing disputes about testing. Test takers are 
entitled to be informed of any available 
means of recourse. 

Comment: When a test taker’s score may 
be questioned and may be invalidated, or 
when a test taker seeks a review or revision 
of his/her score or some other aspect of the 
testing, scoring, or reporting process, the 
test taker is entitled to some orderly process 
for effective input into or review of the 
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decision making of the test administrator or 
test user. Depending upon the magnitude of 
the consequences associated with the test, 
this can range from an internal review of all 
relevant data by a test administrator, to an 
informal conversation with an examinee, to 
a full administrative hearing. The greater 
the consequences, the greater the extent of 
procedural protections that should be made 
available. Test takers should also be made 
aware of procedures for recourse, fees, 
expected time for resolution, and any possi- 
ble consequences for the test taker. Some 
testing programs advise that the test taker 
may be represented by an attorney, although 
possibly at the test taker’s expense. 
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